[RFC,00/11] khugepaged: mTHP support

Message ID	20250108233128.14484-1-npache@redhat.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Nico Pache <npache@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org Subject: [RFC 00/11] khugepaged: mTHP support Date: Wed, 8 Jan 2025 16:31:16 -0700 Message-ID: <20250108233128.14484-1-npache@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	khugepaged: mTHP support \| expand [RFC,00/11] khugepaged: mTHP support [RFC,01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd [RFC,02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot [RFC,03/11] khugepaged: Don't allocate khugepaged mm_slot early [RFC,04/11] khugepaged: rename hpage_collapse_* to khugepaged_* [RFC,05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support [RFC,06/11] khugepaged: generalize alloc_charge_folio for mTHP support [RFC,07/11] khugepaged: generalize __collapse_huge_page_* for mTHP support [RFC,08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support [RFC,09/11] khugepaged: add mTHP support [RFC,10/11] khugepaged: remove max_ptes_none restriction on the pmd scan [RFC,11/11] khugepaged: skip collapsing mTHP to smaller orders

Nico Pache Jan. 8, 2025, 11:31 p.m. UTC

The following series provides khugepaged and madvise collapse with the 
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
determine how full a mTHP order needs to be before collapsing it.

Some design choices to note: 
 - bitmap structures are allocated dynamically because on some arch's 
    (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
    compile time leading to warnings.
 - The recursion is masked through a stack structure.
 - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
    64bit on x86. This provides some optimization on the bitmap operations.
    if other arches/configs that have larger than 512 PTEs per PMD want to 
    compress their bitmap further we can change this value per arch.

Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
Patch 3:    A minor "fix"/optimization
Patch 4:    Refactor/rename hpage_collapse
Patch 5-7:  Generalize khugepaged functions for arbitrary orders
Patch 8-11: The mTHP patches

This series acts as an alternative to Dev Jain's approach [1]. The two 
series differ in a few ways:
  - My approach uses a bitmap to store the state of the linear scan_pmd to
    then determine potential mTHP batches. Devs incorporates his directly
    into the scan, and will try each available order. 
  - Dev is attempting to optimize the locking, while my approach keeps the
    locking changes to a minimum. I believe his changes are not safe for
    uffd.
  - Dev's changes only work for khugepaged not madvise_collapse (although
    i think that was by choice and it could easily support madvise)
  - Dev scales all khugepaged sysfs tunables by order, while im removing 
    the restriction of max_ptes_none and converting it to a scale to 
    determine a (m)THP threshold.
  - Dev turns on khugepaged if any order is available while mine still 
    only runs if PMDs are enabled. I like Dev's approach and will most
    likely do the same in my PATCH posting.
  - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Patch 11 was inspired by one of Dev's changes.

[1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/

Nico Pache (11):
  introduce khugepaged_collapse_single_pmd to collapse a single pmd
  khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
  khugepaged: Don't allocate khugepaged mm_slot early
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: remove max_ptes_none restriction on the pmd scan
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h |   4 +-
 mm/huge_memory.c           |   3 +-
 mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
 3 files changed, 306 insertions(+), 137 deletions(-)

Dev Jain Jan. 9, 2025, 6:22 a.m. UTC | #1

On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)

Before I take a proper look at your series, can you please include any testing
you may have done?

Dev Jain Jan. 9, 2025, 6:27 a.m. UTC | #2

On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Well, I did not miss it :)

int nr_pages = folio_nr_pages(folio);
folio_ref_add(folio, nr_pages - 1);

>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)
>

Nico Pache Jan. 10, 2025, 1:28 a.m. UTC | #3

On Wed, Jan 8, 2025 at 11:27 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Well, I did not miss it :)
Sorry! I missed that in my initial review of your code. Seeing that
would have saved me a few hours of debugging xD

>
> int nr_pages = folio_nr_pages(folio);
> folio_ref_add(folio, nr_pages - 1);

Once I found the fix I forgot to cross reference with your series.
Missing this ref update was causing the issue I alluded to in your RFC
thread. When you said you ran into some issues on the debug configs I
figured it was the same one.


>
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
> >
>

Nico Pache Jan. 10, 2025, 2:27 a.m. UTC | #4

On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
>
> Before I take a proper look at your series, can you please include any testing
> you may have done?

I Built these changes for the following arches: x86_64, arm64,
arm64-64k, ppc64le, s390x

x86 testing:
- Selftests mm
- some stress-ng tests
- compile kernel
- I did some tests with my defer [1] set on top. This pushes all the
work to khugepaged, which removes the noise of all the PF allocations.

I recently got an ARM64 machine and did some simple sanity tests (on
both 4k and 64k) like selftests, stress-ng, and playing around with
the tunables, etc.

I will also be running all the builds through our CI, and perf testing
environments before posting.

[1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/

>

Dev Jain Jan. 10, 2025, 4:56 a.m. UTC | #5

On 10/01/25 7:57 am, Nico Pache wrote:
> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 09/01/25 5:01 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>> determine how full a mTHP order needs to be before collapsing it.
>>>
>>> Some design choices to note:
>>>    - bitmap structures are allocated dynamically because on some arch's
>>>       (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>       compile time leading to warnings.
>>>    - The recursion is masked through a stack structure.
>>>    - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>       64bit on x86. This provides some optimization on the bitmap operations.
>>>       if other arches/configs that have larger than 512 PTEs per PMD want to
>>>       compress their bitmap further we can change this value per arch.
>>>
>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 3:    A minor "fix"/optimization
>>> Patch 4:    Refactor/rename hpage_collapse
>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>> Patch 8-11: The mTHP patches
>>>
>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>> series differ in a few ways:
>>>     - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>       then determine potential mTHP batches. Devs incorporates his directly
>>>       into the scan, and will try each available order.
>>>     - Dev is attempting to optimize the locking, while my approach keeps the
>>>       locking changes to a minimum. I believe his changes are not safe for
>>>       uffd.
>>>     - Dev's changes only work for khugepaged not madvise_collapse (although
>>>       i think that was by choice and it could easily support madvise)
>>>     - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>       the restriction of max_ptes_none and converting it to a scale to
>>>       determine a (m)THP threshold.
>>>     - Dev turns on khugepaged if any order is available while mine still
>>>       only runs if PMDs are enabled. I like Dev's approach and will most
>>>       likely do the same in my PATCH posting.
>>>     - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>
>>> Patch 11 was inspired by one of Dev's changes.
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>
>>> Nico Pache (11):
>>>     introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>     khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>     khugepaged: Don't allocate khugepaged mm_slot early
>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>     khugepaged: add mTHP support
>>>     khugepaged: remove max_ptes_none restriction on the pmd scan
>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>    include/linux/khugepaged.h |   4 +-
>>>    mm/huge_memory.c           |   3 +-
>>>    mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>    3 files changed, 306 insertions(+), 137 deletions(-)
>>
>> Before I take a proper look at your series, can you please include any testing
>> you may have done?
> 
> I Built these changes for the following arches: x86_64, arm64,
> arm64-64k, ppc64le, s390x
> 
> x86 testing:
> - Selftests mm
> - some stress-ng tests
> - compile kernel
> - I did some tests with my defer [1] set on top. This pushes all the
> work to khugepaged, which removes the noise of all the PF allocations.
> 
> I recently got an ARM64 machine and did some simple sanity tests (on
> both 4k and 64k) like selftests, stress-ng, and playing around with
> the tunables, etc.
> 
> I will also be running all the builds through our CI, and perf testing
> environments before posting.
> 
> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> 
>>
> 
I tested your series with the program I was using and it is not working; 
can you please confirm it.

diff --git a/mytests/mthp.c b/mytests/mthp.c
new file mode 100644
index 000000000000..e3029dbcf035
--- /dev/null
+++ b/mytests/mthp.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * Author: Dev Jain <dev.jain@arm.com>
+ *
+ * Program to test khugepaged mTHP collapse
+ */
+
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <string.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/random.h>
+#include <assert.h>
+
+int main(int argc, char *argv[])
+{
+	char *ptr;
+	unsigned long mthp_size = (1UL << 16);
+	size_t chunk_size = (1UL << 25);
+
+	ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (((unsigned long)ptr) != (1UL << 30)) {
+		printf("mmap did not work on required address\n");
+		return 1;
+	}
+
+	/* Fill first pte in every 64K interval */
+	for (int i = 0; i < chunk_size; i += mthp_size)
+		ptr[i] = i;
+
+	if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
+		perror("madvise");
+		return 1;
+	}
+	sleep(100);
+	return 0;
+}

[RFC,00/11] khugepaged: mTHP support

Message

Comments