mbox series

[RFC,00/11] khugepaged: mTHP support

Message ID 20250108233128.14484-1-npache@redhat.com (mailing list archive)
Headers show
Series khugepaged: mTHP support | expand

Message

Nico Pache Jan. 8, 2025, 11:31 p.m. UTC
The following series provides khugepaged and madvise collapse with the 
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
determine how full a mTHP order needs to be before collapsing it.

Some design choices to note: 
 - bitmap structures are allocated dynamically because on some arch's 
    (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
    compile time leading to warnings.
 - The recursion is masked through a stack structure.
 - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
    64bit on x86. This provides some optimization on the bitmap operations.
    if other arches/configs that have larger than 512 PTEs per PMD want to 
    compress their bitmap further we can change this value per arch.

Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
Patch 3:    A minor "fix"/optimization
Patch 4:    Refactor/rename hpage_collapse
Patch 5-7:  Generalize khugepaged functions for arbitrary orders
Patch 8-11: The mTHP patches

This series acts as an alternative to Dev Jain's approach [1]. The two 
series differ in a few ways:
  - My approach uses a bitmap to store the state of the linear scan_pmd to
    then determine potential mTHP batches. Devs incorporates his directly
    into the scan, and will try each available order. 
  - Dev is attempting to optimize the locking, while my approach keeps the
    locking changes to a minimum. I believe his changes are not safe for
    uffd.
  - Dev's changes only work for khugepaged not madvise_collapse (although
    i think that was by choice and it could easily support madvise)
  - Dev scales all khugepaged sysfs tunables by order, while im removing 
    the restriction of max_ptes_none and converting it to a scale to 
    determine a (m)THP threshold.
  - Dev turns on khugepaged if any order is available while mine still 
    only runs if PMDs are enabled. I like Dev's approach and will most
    likely do the same in my PATCH posting.
  - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Patch 11 was inspired by one of Dev's changes.

[1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/

Nico Pache (11):
  introduce khugepaged_collapse_single_pmd to collapse a single pmd
  khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
  khugepaged: Don't allocate khugepaged mm_slot early
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: remove max_ptes_none restriction on the pmd scan
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h |   4 +-
 mm/huge_memory.c           |   3 +-
 mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
 3 files changed, 306 insertions(+), 137 deletions(-)

Comments

Dev Jain Jan. 9, 2025, 6:22 a.m. UTC | #1
On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)

Before I take a proper look at your series, can you please include any testing
you may have done?
Dev Jain Jan. 9, 2025, 6:27 a.m. UTC | #2
On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Well, I did not miss it :)

int nr_pages = folio_nr_pages(folio);
folio_ref_add(folio, nr_pages - 1);

>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)
>