mbox series

[RFC,00/14] mm: userspace hugepage collapse

Message ID 20220308213417.1407042-1-zokeefe@google.com (mailing list archive)
Headers show
Series mm: userspace hugepage collapse | expand

Message

Zach O'Keefe March 8, 2022, 9:34 p.m. UTC
Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was previously introduced by David Rientjes, and thanks to
everyone for your patience while I prepared these patches resulting from
that discussion[1].

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

(*) process_madvise(2)

        Performs a synchronous collapse of the native pages mapped by
        the list of iovecs into transparent hugepages. The default gfp
        flags used will be the same as those used at-fault for the VMA
        region(s) covered. When multiple VMA regions are spanned, if
        faulting-in memory from any VMA would permit synchronous
        compaction and reclaim, then all hugepage allocations required
        to satisfy the request may enter compaction and reclaim.
        Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
        by default, as the user is explicitly requesting this action.
        Define two flags to control collapse semantics, passed through
        process_madvise(2)’s optional flags parameter:

        MADV_F_COLLAPSE_LIMITS

        If supplied, collapse respects pte collapse limits set via
        sysfs:
        /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
        Required if calling on behalf of another process and not
        CAP_SYS_ADMIN.

        MADV_F_COLLAPSE_DEFRAG

        If supplied, permit synchronous compaction and reclaim,
        regardless of VMA flags.

(*) madvise(2)

        Equivalent to process_madvise(2) on self, with no flags
        passed; pte collapse limits are ignored, and the gfp flags will
        be the same as those used at-fault for the VMA region(s)
        covered. Note that, users wanting different collapse semantics
        can always use process_madvise(2) on themselves.

Discussion
--------------------------------

The mechanism is fully compatible with khugepaged, allowing userspace to
separately define synchronous and asynchronous hugepage policies, as
priority dictates. It also naturally permits a DAMON scheme,
DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
system by backing the most frequently accessed memory by hugepages[2].
Though not required to justify this series, hugepage management could be
offloaded entirely to a sufficiently informed userspace agent,
supplanting the need for khugepaged in the kernel.

Along with the interface, this series proposes a batched implementation
to collapse a range of memory. The motivation for this is to limit
contention on mmap_lock, doing multiple page table modifications while
the lock is held exclusively.

Only private anonymous memory is supported by this series. File-backed
memory support will be added later.

Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
considered at this time, but could be supported by the flags parameter
in the future.

kselftests were omitted from this series for brevity, but would be
included in an eventual patch submission.

[2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/

Sequence of Patches
--------------------------------

Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
introducing the notion of a collapse context and isolating logic that
can be reused later in the series for the madvise collapse context.

Patches 11-14 introduce logic for the proposed madvise collapse
mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
13, separately, add the core collapse logic, with the former introducing
the overall batched approach and locking strategy, and the latter
fills-in batch action details. This separation was purely to keep patch
size down. Patch 14 adds process_madvise support.

Applies against next-20220308.

Zach O'Keefe (14):
  mm/rmap: add mm_find_pmd_raw helper
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: add __do_collapse_huge_page() helper
  mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
  mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
  mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  mm/khugepaged: add vm_flags_ignore to
    hugepage_vma_revalidate_pmd_count()
  mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
  mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  mm/khugepaged: rename khugepaged-specific/not functions
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  mm/madvise: add __madvise_collapse_*_batch() actions.
  mm/madvise: add process_madvise(MADV_COLLAPSE)

 fs/io_uring.c                          |   3 +-
 include/linux/huge_mm.h                |  27 +-
 include/linux/mm.h                     |   3 +-
 include/uapi/asm-generic/mman-common.h |  10 +
 mm/huge_memory.c                       |   2 +-
 mm/internal.h                          |   1 +
 mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
 mm/madvise.c                           |  45 +-
 mm/memory.c                            |   6 +-
 mm/rmap.c                              |  15 +-
 10 files changed, 842 insertions(+), 207 deletions(-)

Comments

Zi Yan March 21, 2022, 2:32 p.m. UTC | #1
On 8 Mar 2022, at 16:34, Zach O'Keefe wrote:

> Introduction
> --------------------------------
>
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
>
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FC8C89F13-3F04-456B-BA76-DE2C378D30BF%40nvidia.com%2F&data=04%7C01%7Cziy%40nvidia.com%7C7bcd2b7a8e4a424ab75908da014b76f9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637823721375395857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4bHCbskcQmp0Nu7ds7XCDFLty964672zCQPXILC25C8%3D&reserved=0
>
> Interface
> --------------------------------
>
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and

Can we have a better name instead of MADV_COLLAPSE? It sounds like it is
destroying a huge page but in fact doing the opposite. Something like
MADV_CREATE_HUGE_PAGE? I know the kernel functions uses collapse everywhere
but it might be better not to confuse the user.

Thanks.

--
Best Regards,
Yan, Zi
Michal Hocko March 21, 2022, 2:37 p.m. UTC | #2
[ Removed  Richard Henderson from the CC list as the delivery fails for
  his address]
On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> Introduction
> --------------------------------
> 
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
> 
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
> 
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> 
> Interface
> --------------------------------
> 
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
> 
> (*) process_madvise(2)
> 
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered.

Could you expand on reasoning here? The default allocation mode for #PF
is rather light. Madvised will try harder. The reasoning is that we want
to make stalls due to #PF as small as possible and only try harder for
madvised areas (also a subject of configuration). Wouldn't it make more
sense to try harder for an explicit calls like madvise?

>	  When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.

I am not sure I follow here. Let's have a memory range spanning two
vmas, one with MADV_HUGEPAGE.

>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:

This part is discussed later in the thread.

> 
>         MADV_F_COLLAPSE_LIMITS
> 
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
> 
>         MADV_F_COLLAPSE_DEFRAG
> 
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.

Why do we need this?
Zach O'Keefe March 21, 2022, 2:51 p.m. UTC | #3
On Mon, Mar 21, 2022 at 7:32 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Mar 2022, at 16:34, Zach O'Keefe wrote:
>
> > Introduction
> > --------------------------------
> >
> > This series provides a mechanism for userspace to induce a collapse of
> > eligible ranges of memory into transparent hugepages in process context,
> > thus permitting users to more tightly control their own hugepage
> > utilization policy at their own expense.
> >
> > This idea was previously introduced by David Rientjes, and thanks to
> > everyone for your patience while I prepared these patches resulting from
> > that discussion[1].
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FC8C89F13-3F04-456B-BA76-DE2C378D30BF%40nvidia.com%2F&amp;data=04%7C01%7Cziy%40nvidia.com%7C7bcd2b7a8e4a424ab75908da014b76f9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637823721375395857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4bHCbskcQmp0Nu7ds7XCDFLty964672zCQPXILC25C8%3D&amp;reserved=0
> >
> > Interface
> > --------------------------------
> >
> > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
>
> Can we have a better name instead of MADV_COLLAPSE? It sounds like it is
> destroying a huge page but in fact doing the opposite. Something like
> MADV_CREATE_HUGE_PAGE? I know the kernel functions uses collapse everywhere
> but it might be better not to confuse the user.
>

Hey Zi, thanks for reviewing / commenting. I briefly thought about
"coalesce", but, "collapse" isn't just used within the kernel; it's
already part of existing user apis such as the thp sysfs interface
(/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed),
vmstat (ex /proc/vmstat:thp_collapse_alloc[_failed]), per-memcg stats
(memory.stat:thp_collapse_alloc) and tracepoints (ex
mm_collapse_huge_page). I'm not married to it though.


> Thanks.
>
> --
> Best Regards,
> Yan, Zi
Zach O'Keefe March 21, 2022, 3:46 p.m. UTC | #4
Hey Michal, thanks for taking the time to review / comment.

On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [ Removed  Richard Henderson from the CC list as the delivery fails for
>   his address]

Thank you :)

> On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > Introduction
> > --------------------------------
> >
> > This series provides a mechanism for userspace to induce a collapse of
> > eligible ranges of memory into transparent hugepages in process context,
> > thus permitting users to more tightly control their own hugepage
> > utilization policy at their own expense.
> >
> > This idea was previously introduced by David Rientjes, and thanks to
> > everyone for your patience while I prepared these patches resulting from
> > that discussion[1].
> >
> > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> >
> > Interface
> > --------------------------------
> >
> > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > leverages the new process_madvise(2) call.
> >
> > (*) process_madvise(2)
> >
> >         Performs a synchronous collapse of the native pages mapped by
> >         the list of iovecs into transparent hugepages. The default gfp
> >         flags used will be the same as those used at-fault for the VMA
> >         region(s) covered.
>
> Could you expand on reasoning here? The default allocation mode for #PF
> is rather light. Madvised will try harder. The reasoning is that we want
> to make stalls due to #PF as small as possible and only try harder for
> madvised areas (also a subject of configuration). Wouldn't it make more
> sense to try harder for an explicit calls like madvise?
>

The reasoning is that the user has presumably configured system/vmas
to tell the kernel how badly they want thps, and so this call aligns
with current expectations. I.e. a user who goes about the trouble of
trying to fault-in a thp at a given memory address likely wants a thp
"as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
thp.

If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
used to explicitly request the kernel to try harder, as you mention.

> >         When multiple VMA regions are spanned, if
> >         faulting-in memory from any VMA would permit synchronous
> >         compaction and reclaim, then all hugepage allocations required
> >         to satisfy the request may enter compaction and reclaim.
>
> I am not sure I follow here. Let's have a memory range spanning two
> vmas, one with MADV_HUGEPAGE.

I think you are rightly confused here, since the code doesn't
currently match this description - thanks for pointing it out.

The idea* was that, in the case you provided, the gfp flags used for
all thp allocations would match those used for a MADV_HUGEPAGE vma,
under current system settings. IOW, we treat the semantics of the
collapse for the entire range uniformly (aside from MADV_NOHUGEPAGE,
as per earlier discussions).

So, for example, if transparent_hugepage/enabled was set to "always"
and transparent_hugepage/defrag was set to "madvise", then all
allocations could enter direct reclaim. The reasoning for this is, #1
the user has already told us that entering direct reclaim is tolerable
for this syscall, and they can wait. #2 is that MADV_COLLAPSE might
yield confusing results otherwise; some ranges might get backed by
thps, while others may not. Also, a single MADV_HUGEPAGE vma early in
the range might permit enough reclaim/compaction that allows
successive non-MADV_HUGEPAGE allocations to succeed where they
otherwise may not have.

However, the code and this description disagree, since madvise
decomposes the call over multiple vmas into iterative
madvise_vma_behavior() over a single vma, with no state shared between
calls. If the motivation above is sufficient, then this could be
added.

>
> >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> >         by default, as the user is explicitly requesting this action.
> >         Define two flags to control collapse semantics, passed through
> >         process_madvise(2)’s optional flags parameter:
>
> This part is discussed later in the thread.
>
> >
> >         MADV_F_COLLAPSE_LIMITS
> >
> >         If supplied, collapse respects pte collapse limits set via
> >         sysfs:
> >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> >         Required if calling on behalf of another process and not
> >         CAP_SYS_ADMIN.
> >
> >         MADV_F_COLLAPSE_DEFRAG
> >
> >         If supplied, permit synchronous compaction and reclaim,
> >         regardless of VMA flags.
>
> Why do we need this?

Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?

* MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
inter-process protection for collapsing memory in another process'
address space (which a malevolent program could exploit to cause oom
conditions in another memcg hierarchy, for example), but we want
privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
utilization as they wish.

* MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
to explicitly tell the kernel to try harder to back this by thps,
regardless of the current system/vma configuration.

Note that when used together, these flags can be used to implement the
exact behavior of khugepaged, through MADV_COLLAPSE.

> --
> Michal Hocko
> SUSE Labs
Zach O'Keefe March 22, 2022, 6:40 a.m. UTC | #5
On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Introduction
> --------------------------------
>
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
>
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
>
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
>
> Interface
> --------------------------------
>
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
>
> (*) process_madvise(2)
>
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered. When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.
>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:
>
>         MADV_F_COLLAPSE_LIMITS
>
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
>
>         MADV_F_COLLAPSE_DEFRAG
>
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.
>
> (*) madvise(2)
>
>         Equivalent to process_madvise(2) on self, with no flags
>         passed; pte collapse limits are ignored, and the gfp flags will
>         be the same as those used at-fault for the VMA region(s)
>         covered. Note that, users wanting different collapse semantics
>         can always use process_madvise(2) on themselves.
>
> Discussion
> --------------------------------
>
> The mechanism is fully compatible with khugepaged, allowing userspace to
> separately define synchronous and asynchronous hugepage policies, as
> priority dictates. It also naturally permits a DAMON scheme,
> DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
> system by backing the most frequently accessed memory by hugepages[2].
> Though not required to justify this series, hugepage management could be
> offloaded entirely to a sufficiently informed userspace agent,
> supplanting the need for khugepaged in the kernel.
>
> Along with the interface, this series proposes a batched implementation
> to collapse a range of memory. The motivation for this is to limit
> contention on mmap_lock, doing multiple page table modifications while
> the lock is held exclusively.
>
> Only private anonymous memory is supported by this series. File-backed
> memory support will be added later.
>
> Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
> considered at this time, but could be supported by the flags parameter
> in the future.
>
> kselftests were omitted from this series for brevity, but would be
> included in an eventual patch submission.
>
> [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
>
> Sequence of Patches
> --------------------------------
>
> Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
> introducing the notion of a collapse context and isolating logic that
> can be reused later in the series for the madvise collapse context.
>
> Patches 11-14 introduce logic for the proposed madvise collapse
> mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
> 13, separately, add the core collapse logic, with the former introducing
> the overall batched approach and locking strategy, and the latter
> fills-in batch action details. This separation was purely to keep patch
> size down. Patch 14 adds process_madvise support.
>
> Applies against next-20220308.
>
> Zach O'Keefe (14):
>   mm/rmap: add mm_find_pmd_raw helper
>   mm/khugepaged: add struct collapse_control
>   mm/khugepaged: add __do_collapse_huge_page() helper
>   mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
>   mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
>   mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
>   mm/khugepaged: add vm_flags_ignore to
>     hugepage_vma_revalidate_pmd_count()
>   mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
>   mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
>   mm/khugepaged: rename khugepaged-specific/not functions
>   mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
>   mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
>   mm/madvise: add __madvise_collapse_*_batch() actions.
>   mm/madvise: add process_madvise(MADV_COLLAPSE)
>
>  fs/io_uring.c                          |   3 +-
>  include/linux/huge_mm.h                |  27 +-
>  include/linux/mm.h                     |   3 +-
>  include/uapi/asm-generic/mman-common.h |  10 +
>  mm/huge_memory.c                       |   2 +-
>  mm/internal.h                          |   1 +
>  mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
>  mm/madvise.c                           |  45 +-
>  mm/memory.c                            |   6 +-
>  mm/rmap.c                              |  15 +-
>  10 files changed, 842 insertions(+), 207 deletions(-)
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks to the many people who took the time to review and provide
feedback on this series.

In preparation of a V1 PATCH series which will incorporate the
feedback received here, one item I'd specifically like feedback from
the community on is whether support for privately-mapped anonymous
memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
with file-backed support coming later. I have local patches to support
file-backed memory, but my thought was to keep the series no longer
than necessary, for the consideration of reviewers. No substantial
infrastructure changes are required to support file-backed memory; it
naturally builds on top of the existing series (as it was developed
with file-backed support flushed-out).

Thanks,
Zach
Michal Hocko March 22, 2022, 12:05 p.m. UTC | #6
On Mon 21-03-22 23:40:39, Zach O'Keefe wrote:
[...]
> In preparation of a V1 PATCH series which will incorporate the
> feedback received here, one item I'd specifically like feedback from
> the community on is whether support for privately-mapped anonymous
> memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
> with file-backed support coming later.

Yes I think this should be sufficient for the initial implementation.

> I have local patches to support
> file-backed memory, but my thought was to keep the series no longer
> than necessary, for the consideration of reviewers.

Agreed! I think we should focus on the semantic of the anonymous memory
first.
Michal Hocko March 22, 2022, 12:11 p.m. UTC | #7
On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> Hey Michal, thanks for taking the time to review / comment.
> 
> On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [ Removed  Richard Henderson from the CC list as the delivery fails for
> >   his address]
> 
> Thank you :)
> 
> > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > Introduction
> > > --------------------------------
> > >
> > > This series provides a mechanism for userspace to induce a collapse of
> > > eligible ranges of memory into transparent hugepages in process context,
> > > thus permitting users to more tightly control their own hugepage
> > > utilization policy at their own expense.
> > >
> > > This idea was previously introduced by David Rientjes, and thanks to
> > > everyone for your patience while I prepared these patches resulting from
> > > that discussion[1].
> > >
> > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > >
> > > Interface
> > > --------------------------------
> > >
> > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > leverages the new process_madvise(2) call.
> > >
> > > (*) process_madvise(2)
> > >
> > >         Performs a synchronous collapse of the native pages mapped by
> > >         the list of iovecs into transparent hugepages. The default gfp
> > >         flags used will be the same as those used at-fault for the VMA
> > >         region(s) covered.
> >
> > Could you expand on reasoning here? The default allocation mode for #PF
> > is rather light. Madvised will try harder. The reasoning is that we want
> > to make stalls due to #PF as small as possible and only try harder for
> > madvised areas (also a subject of configuration). Wouldn't it make more
> > sense to try harder for an explicit calls like madvise?
> >
> 
> The reasoning is that the user has presumably configured system/vmas
> to tell the kernel how badly they want thps, and so this call aligns
> with current expectations. I.e. a user who goes about the trouble of
> trying to fault-in a thp at a given memory address likely wants a thp
> "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> thp.

If the syscall tries only as hard as the #PF doesn't that limit the
functionality? I mean a non #PF can consume more resources to allocate
and collapse a THP as it won't inflict any measurable latency to the
targetting process (except for potential CPU contention). From that
perspective madvise is much more similar to khugepaged. I would even
argue that it could try even harder because madvise is focused on a very
specific memory range and the execution is not shared among all
processes that are scanned by khugepaged.

> If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
> used to explicitly request the kernel to try harder, as you mention.

Do we really need that? How many do_harder levels do we want to support?

What would be typical usecases for #PF based and DEFRAG usages?

[...]

> > >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> > >         by default, as the user is explicitly requesting this action.
> > >         Define two flags to control collapse semantics, passed through
> > >         process_madvise(2)’s optional flags parameter:
> >
> > This part is discussed later in the thread.
> >
> > >
> > >         MADV_F_COLLAPSE_LIMITS
> > >
> > >         If supplied, collapse respects pte collapse limits set via
> > >         sysfs:
> > >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> > >         Required if calling on behalf of another process and not
> > >         CAP_SYS_ADMIN.
> > >
> > >         MADV_F_COLLAPSE_DEFRAG
> > >
> > >         If supplied, permit synchronous compaction and reclaim,
> > >         regardless of VMA flags.
> >
> > Why do we need this?
> 
> Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> 
> * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> inter-process protection for collapsing memory in another process'
> address space (which a malevolent program could exploit to cause oom
> conditions in another memcg hierarchy, for example), but we want
> privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> utilization as they wish.

Could you expand some more please? How is this any different from
khugepaged (well, except that you can trigger the collapsing explicitly
rather than rely on khugepaged to find that mm)?

> * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> to explicitly tell the kernel to try harder to back this by thps,
> regardless of the current system/vma configuration.
> 
> Note that when used together, these flags can be used to implement the
> exact behavior of khugepaged, through MADV_COLLAPSE.

IMHO this is stretching the interface and this can backfire in the
future. The interface should be really trivial. I want to collapse a
memory area. Let the kernel do the right thing and do not bother with
all the implementation details. I would use the same allocation strategy
as khugepaged as this seems to be closesest from the latency and
application awareness POV. In a way you can look at the madvise call as
a way to trigger khugepaged functionality on he particular memory range.
Zach O'Keefe March 22, 2022, 3:53 p.m. UTC | #8
On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > Hey Michal, thanks for taking the time to review / comment.
> >
> > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > >   his address]
> >
> > Thank you :)
> >
> > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > Introduction
> > > > --------------------------------
> > > >
> > > > This series provides a mechanism for userspace to induce a collapse of
> > > > eligible ranges of memory into transparent hugepages in process context,
> > > > thus permitting users to more tightly control their own hugepage
> > > > utilization policy at their own expense.
> > > >
> > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > everyone for your patience while I prepared these patches resulting from
> > > > that discussion[1].
> > > >
> > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > >
> > > > Interface
> > > > --------------------------------
> > > >
> > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > leverages the new process_madvise(2) call.
> > > >
> > > > (*) process_madvise(2)
> > > >
> > > >         Performs a synchronous collapse of the native pages mapped by
> > > >         the list of iovecs into transparent hugepages. The default gfp
> > > >         flags used will be the same as those used at-fault for the VMA
> > > >         region(s) covered.
> > >
> > > Could you expand on reasoning here? The default allocation mode for #PF
> > > is rather light. Madvised will try harder. The reasoning is that we want
> > > to make stalls due to #PF as small as possible and only try harder for
> > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > sense to try harder for an explicit calls like madvise?
> > >
> >
> > The reasoning is that the user has presumably configured system/vmas
> > to tell the kernel how badly they want thps, and so this call aligns
> > with current expectations. I.e. a user who goes about the trouble of
> > trying to fault-in a thp at a given memory address likely wants a thp
> > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > thp.
>
> If the syscall tries only as hard as the #PF doesn't that limit the
> functionality?

I'd argue that, the various allocation semantics possible through
existing thp knobs / vma flags, in addition to the proposed
MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
work with. Relatively speaking, in what way would we be lacking
functionality?

> I mean a non #PF can consume more resources to allocate
> and collapse a THP as it won't inflict any measurable latency to the
> targetting process (except for potential CPU contention).

Sorry, I'm not sure I understand this. What latency are we discussing
in this point? Do you mean to say that since MADV_COLLAPSE isn't in
the fault path, it doesn't necessarily need to be fast / direct
reclaim wouldn't be noticed?

> From that
> perspective madvise is much more similar to khugepaged. I would even
> argue that it could try even harder because madvise is focused on a very
> specific memory range and the execution is not shared among all
> processes that are scanned by khugepaged.
>

Good point. Covered at the end.

> > If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
> > used to explicitly request the kernel to try harder, as you mention.
>
> Do we really need that? How many do_harder levels do we want to support?
>
> What would be typical usecases for #PF based and DEFRAG usages?
>

Thanks for challenging this. Covered at the end.

> [...]
>
> > > >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> > > >         by default, as the user is explicitly requesting this action.
> > > >         Define two flags to control collapse semantics, passed through
> > > >         process_madvise(2)’s optional flags parameter:
> > >
> > > This part is discussed later in the thread.
> > >
> > > >
> > > >         MADV_F_COLLAPSE_LIMITS
> > > >
> > > >         If supplied, collapse respects pte collapse limits set via
> > > >         sysfs:
> > > >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> > > >         Required if calling on behalf of another process and not
> > > >         CAP_SYS_ADMIN.
> > > >
> > > >         MADV_F_COLLAPSE_DEFRAG
> > > >
> > > >         If supplied, permit synchronous compaction and reclaim,
> > > >         regardless of VMA flags.
> > >
> > > Why do we need this?
> >
> > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> >
> > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > inter-process protection for collapsing memory in another process'
> > address space (which a malevolent program could exploit to cause oom
> > conditions in another memcg hierarchy, for example), but we want
> > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > utilization as they wish.
>
> Could you expand some more please? How is this any different from
> khugepaged (well, except that you can trigger the collapsing explicitly
> rather than rely on khugepaged to find that mm)?
>

MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
extend khugepaged in userspace, where the benefit is precisely that we
can choose that mm/vma more intelligently.

> > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > to explicitly tell the kernel to try harder to back this by thps,
> > regardless of the current system/vma configuration.
> >
> > Note that when used together, these flags can be used to implement the
> > exact behavior of khugepaged, through MADV_COLLAPSE.
>
> IMHO this is stretching the interface and this can backfire in the
> future. The interface should be really trivial. I want to collapse a
> memory area. Let the kernel do the right thing and do not bother with
> all the implementation details. I would use the same allocation strategy
> as khugepaged as this seems to be closesest from the latency and
> application awareness POV. In a way you can look at the madvise call as
> a way to trigger khugepaged functionality on he particular memory range.

Trying to summarize a few earlier comments centering around
MADV_F_COLLAPSE_DEFRAG and allocation semantics.

This series presupposes the existence of an informed userspace agent
that is aware of what processes/memory ranges would benefit most from
thps. Such an agent might either be:
(1) A system-level daemon optimizing thp utilization system-wide
(2) A highly tuned process / malloc implementation optimizing their
own thp usage

The different types of agents reflects the divide between #PF and
DEFRAG semantics.

For (1), we want to view this exactly like triggering khugepaged
functionality from userspace, and likely want DEFRAG semantics.

For (2), I was viewing this as the "live" symmetric counterpart to
at-fault thp allocation where the process has decided, at runtime,
that this memory could benefit from thp backing, and so #PF semantics
seemed like sane default. I'd worry that using DEFRAG semantics by
default might deter adoption by users who might not be willing to wait
an unbounded amount of time for direct reclaim.


> --
> Michal Hocko
> SUSE Labs
Zach O'Keefe March 23, 2022, 1:30 p.m. UTC | #9
On Tue, Mar 22, 2022 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-03-22 23:40:39, Zach O'Keefe wrote:
> [...]
> > In preparation of a V1 PATCH series which will incorporate the
> > feedback received here, one item I'd specifically like feedback from
> > the community on is whether support for privately-mapped anonymous
> > memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
> > with file-backed support coming later.
>
> Yes I think this should be sufficient for the initial implementation.
>
> > I have local patches to support
> > file-backed memory, but my thought was to keep the series no longer
> > than necessary, for the consideration of reviewers.
>
> Agreed! I think we should focus on the semantic of the anonymous memory
> first.

Great! Sounds good to me and thanks again for the review & feedback.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 29, 2022, 12:24 p.m. UTC | #10
On Tue 22-03-22 08:53:35, Zach O'Keefe wrote:
> On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > > Hey Michal, thanks for taking the time to review / comment.
> > >
> > > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > > >   his address]
> > >
> > > Thank you :)
> > >
> > > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > > Introduction
> > > > > --------------------------------
> > > > >
> > > > > This series provides a mechanism for userspace to induce a collapse of
> > > > > eligible ranges of memory into transparent hugepages in process context,
> > > > > thus permitting users to more tightly control their own hugepage
> > > > > utilization policy at their own expense.
> > > > >
> > > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > > everyone for your patience while I prepared these patches resulting from
> > > > > that discussion[1].
> > > > >
> > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > > >
> > > > > Interface
> > > > > --------------------------------
> > > > >
> > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > > leverages the new process_madvise(2) call.
> > > > >
> > > > > (*) process_madvise(2)
> > > > >
> > > > >         Performs a synchronous collapse of the native pages mapped by
> > > > >         the list of iovecs into transparent hugepages. The default gfp
> > > > >         flags used will be the same as those used at-fault for the VMA
> > > > >         region(s) covered.
> > > >
> > > > Could you expand on reasoning here? The default allocation mode for #PF
> > > > is rather light. Madvised will try harder. The reasoning is that we want
> > > > to make stalls due to #PF as small as possible and only try harder for
> > > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > > sense to try harder for an explicit calls like madvise?
> > > >
> > >
> > > The reasoning is that the user has presumably configured system/vmas
> > > to tell the kernel how badly they want thps, and so this call aligns
> > > with current expectations. I.e. a user who goes about the trouble of
> > > trying to fault-in a thp at a given memory address likely wants a thp
> > > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > > thp.
> >
> > If the syscall tries only as hard as the #PF doesn't that limit the
> > functionality?
> 
> I'd argue that, the various allocation semantics possible through
> existing thp knobs / vma flags, in addition to the proposed
> MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
> work with. Relatively speaking, in what way would we be lacking
> functionality?

Flexibility is definitely a plus but look at our existing configuration
space and try to wrap your head around that.

> > I mean a non #PF can consume more resources to allocate
> > and collapse a THP as it won't inflict any measurable latency to the
> > targetting process (except for potential CPU contention).
> 
> Sorry, I'm not sure I understand this. What latency are we discussing
> in this point? Do you mean to say that since MADV_COLLAPSE isn't in
> the fault path, it doesn't necessarily need to be fast / direct
> reclaim wouldn't be noticed?

Exactly. Same as khugepaged. I would even argue that khugepaged and
madvise would better behave consistently because in both cases it is a
remote operation to create THPs. One triggered automatically the other
explicitly requested by the userspace. Having a third mode (for madvise)
would add more to the configuration space and a thus a complexity.
[...]
> > > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> > >
> > > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > > inter-process protection for collapsing memory in another process'
> > > address space (which a malevolent program could exploit to cause oom
> > > conditions in another memcg hierarchy, for example), but we want
> > > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > > utilization as they wish.
> >
> > Could you expand some more please? How is this any different from
> > khugepaged (well, except that you can trigger the collapsing explicitly
> > rather than rely on khugepaged to find that mm)?
> >
> 
> MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
> extend khugepaged in userspace, where the benefit is precisely that we
> can choose that mm/vma more intelligently.

Could you elaborate some more?

> > > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > > to explicitly tell the kernel to try harder to back this by thps,
> > > regardless of the current system/vma configuration.
> > >
> > > Note that when used together, these flags can be used to implement the
> > > exact behavior of khugepaged, through MADV_COLLAPSE.
> >
> > IMHO this is stretching the interface and this can backfire in the
> > future. The interface should be really trivial. I want to collapse a
> > memory area. Let the kernel do the right thing and do not bother with
> > all the implementation details. I would use the same allocation strategy
> > as khugepaged as this seems to be closesest from the latency and
> > application awareness POV. In a way you can look at the madvise call as
> > a way to trigger khugepaged functionality on he particular memory range.
> 
> Trying to summarize a few earlier comments centering around
> MADV_F_COLLAPSE_DEFRAG and allocation semantics.
> 
> This series presupposes the existence of an informed userspace agent
> that is aware of what processes/memory ranges would benefit most from
> thps. Such an agent might either be:
> (1) A system-level daemon optimizing thp utilization system-wide
> (2) A highly tuned process / malloc implementation optimizing their
> own thp usage
> 
> The different types of agents reflects the divide between #PF and
> DEFRAG semantics.
> 
> For (1), we want to view this exactly like triggering khugepaged
> functionality from userspace, and likely want DEFRAG semantics.
> 
> For (2), I was viewing this as the "live" symmetric counterpart to
> at-fault thp allocation where the process has decided, at runtime,
> that this memory could benefit from thp backing, and so #PF semantics
> seemed like sane default. I'd worry that using DEFRAG semantics by
> default might deter adoption by users who might not be willing to wait
> an unbounded amount of time for direct reclaim.

This time is not really unbound. THP even in the defrag mode doesn't
even try to be as hard as e.g. hugetlb allocations.

For your 2) category I am not really sure I see the point. Why would
you want to rely on madvise in a lightweight allocation mode when this
has been already done during the #PF time. If an application really
knows it wants to use THP then madvise(MADV_HUGEPAGE) would be the first
thing to do. This would already tell #PF to try a bit harder in some
configurations and khugepaged knows that collapsing memory makes sense.

That being said I would be really careful to provide an extended
interface to control how hard to try to allocate a THP. This has a high
risk of externalizing internal implementation details about how the
compaction works. Unless we have a strong real life usecase I would go
with the khugepaged semantic initially. Maybe we will learn about future
usecases where a very lightweight allocation mode is required but that
can be added later on. The simpler the interface is initially the
better.

Thanks!
Zach O'Keefe March 30, 2022, 12:36 a.m. UTC | #11
Hey Michal,

Thanks again for taking the time to discuss and align on this last point.

On Tue, Mar 29, 2022 at 5:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-03-22 08:53:35, Zach O'Keefe wrote:
> > On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > > > Hey Michal, thanks for taking the time to review / comment.
> > > >
> > > > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > > > >   his address]
> > > >
> > > > Thank you :)
> > > >
> > > > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > > > Introduction
> > > > > > --------------------------------
> > > > > >
> > > > > > This series provides a mechanism for userspace to induce a collapse of
> > > > > > eligible ranges of memory into transparent hugepages in process context,
> > > > > > thus permitting users to more tightly control their own hugepage
> > > > > > utilization policy at their own expense.
> > > > > >
> > > > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > > > everyone for your patience while I prepared these patches resulting from
> > > > > > that discussion[1].
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > > > >
> > > > > > Interface
> > > > > > --------------------------------
> > > > > >
> > > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > > > leverages the new process_madvise(2) call.
> > > > > >
> > > > > > (*) process_madvise(2)
> > > > > >
> > > > > >         Performs a synchronous collapse of the native pages mapped by
> > > > > >         the list of iovecs into transparent hugepages. The default gfp
> > > > > >         flags used will be the same as those used at-fault for the VMA
> > > > > >         region(s) covered.
> > > > >
> > > > > Could you expand on reasoning here? The default allocation mode for #PF
> > > > > is rather light. Madvised will try harder. The reasoning is that we want
> > > > > to make stalls due to #PF as small as possible and only try harder for
> > > > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > > > sense to try harder for an explicit calls like madvise?
> > > > >
> > > >
> > > > The reasoning is that the user has presumably configured system/vmas
> > > > to tell the kernel how badly they want thps, and so this call aligns
> > > > with current expectations. I.e. a user who goes about the trouble of
> > > > trying to fault-in a thp at a given memory address likely wants a thp
> > > > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > > > thp.
> > >
> > > If the syscall tries only as hard as the #PF doesn't that limit the
> > > functionality?
> >
> > I'd argue that, the various allocation semantics possible through
> > existing thp knobs / vma flags, in addition to the proposed
> > MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
> > work with. Relatively speaking, in what way would we be lacking
> > functionality?
>
> Flexibility is definitely a plus but look at our existing configuration
> space and try to wrap your head around that.
>

:)

> > > I mean a non #PF can consume more resources to allocate
> > > and collapse a THP as it won't inflict any measurable latency to the
> > > targetting process (except for potential CPU contention).
> >
> > Sorry, I'm not sure I understand this. What latency are we discussing
> > in this point? Do you mean to say that since MADV_COLLAPSE isn't in
> > the fault path, it doesn't necessarily need to be fast / direct
> > reclaim wouldn't be noticed?
>
> Exactly. Same as khugepaged. I would even argue that khugepaged and
> madvise would better behave consistently because in both cases it is a
> remote operation to create THPs. One triggered automatically the other
> explicitly requested by the userspace. Having a third mode (for madvise)
> would add more to the configuration space and a thus a complexity.
> [...]

Got it. I combined this with the answer at the end.

> > > > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> > > >
> > > > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > > > inter-process protection for collapsing memory in another process'
> > > > address space (which a malevolent program could exploit to cause oom
> > > > conditions in another memcg hierarchy, for example), but we want
> > > > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > > > utilization as they wish.
> > >
> > > Could you expand some more please? How is this any different from
> > > khugepaged (well, except that you can trigger the collapsing explicitly
> > > rather than rely on khugepaged to find that mm)?
> > >
> >
> > MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
> > extend khugepaged in userspace, where the benefit is precisely that we
> > can choose that mm/vma more intelligently.
>
> Could you elaborate some more?
>

One idea from the original RFC was moving khugepaged to userspace[1].
Eventually, uhugepaged could be further augmented/informed with task
prioritization or runtime metrics to optimize THP utilization
system-wide by making the best use of the available THPs on the system
at any given point. This flag was partially motivated to allow for a
step (1) where khugepaged is replicated as-is, in userspace.

The other motivation is just to provide users a choice w.r.t. how hard
to try for a THP. Abiding by khugepaged-like semantics was the
default, but an informed user might have good reason to back memory
that is currently 90% swapped-out by THPs.

Perhaps for the initial series, we can forgo this flag for simplicity,
assume the user is informed, and ignore pte limits. We can revisit
this as necessary in the future, if the need arises.

> > > > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > > > to explicitly tell the kernel to try harder to back this by thps,
> > > > regardless of the current system/vma configuration.
> > > >
> > > > Note that when used together, these flags can be used to implement the
> > > > exact behavior of khugepaged, through MADV_COLLAPSE.
> > >
> > > IMHO this is stretching the interface and this can backfire in the
> > > future. The interface should be really trivial. I want to collapse a
> > > memory area. Let the kernel do the right thing and do not bother with
> > > all the implementation details. I would use the same allocation strategy
> > > as khugepaged as this seems to be closesest from the latency and
> > > application awareness POV. In a way you can look at the madvise call as
> > > a way to trigger khugepaged functionality on he particular memory range.
> >
> > Trying to summarize a few earlier comments centering around
> > MADV_F_COLLAPSE_DEFRAG and allocation semantics.
> >
> > This series presupposes the existence of an informed userspace agent
> > that is aware of what processes/memory ranges would benefit most from
> > thps. Such an agent might either be:
> > (1) A system-level daemon optimizing thp utilization system-wide
> > (2) A highly tuned process / malloc implementation optimizing their
> > own thp usage
> >
> > The different types of agents reflects the divide between #PF and
> > DEFRAG semantics.
> >
> > For (1), we want to view this exactly like triggering khugepaged
> > functionality from userspace, and likely want DEFRAG semantics.
> >
> > For (2), I was viewing this as the "live" symmetric counterpart to
> > at-fault thp allocation where the process has decided, at runtime,
> > that this memory could benefit from thp backing, and so #PF semantics
> > seemed like sane default. I'd worry that using DEFRAG semantics by
> > default might deter adoption by users who might not be willing to wait
> > an unbounded amount of time for direct reclaim.
>
> This time is not really unbound. THP even in the defrag mode doesn't
> even try to be as hard as e.g. hugetlb allocations.
>
> For your 2) category I am not really sure I see the point. Why would
> you want to rely on madvise in a lightweight allocation mode when this
> has been already done during the #PF time. If an application really
> knows it wants to use THP then madvise(MADV_HUGEPAGE) would be the first
> thing to do. This would already tell #PF to try a bit harder in some
> configurations and khugepaged knows that collapsing memory makes sense.
>

The primary motivation here is that at some point much after
fault-time, the process may determine they'd like the memory backed by
hugepages - but still prefer a lightweight allocation. A system
allocator here is a canonical example: they might free memory via
MADV_DONTNEED, but at some later point want to MADV_COLLAPSE the
region again once it becomes heavily used; however, they wouldn't be
willing to tolerate reclaim to do so.

> That being said I would be really careful to provide an extended
> interface to control how hard to try to allocate a THP. This has a high
> risk of externalizing internal implementation details about how the
> compaction works. Unless we have a strong real life usecase I would go
> with the khugepaged semantic initially. Maybe we will learn about future
> usecases where a very lightweight allocation mode is required but that
> can be added later on. The simpler the interface is initially the
> better.
>
Understand and respect your thoughts here. I won't pretend to know
what the best* option is, but presumably having control over when to
allow reclaim was important enough to motivate our current, extensive
configuration space.

Without the option, we have no control from userspace. With it, we
may* have too much. Initially, I'll propose a simple interface which
defaults to whatever is in
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and we can
incrementally expand if/when is necessary.

> Thanks!
> --
> Michal Hocko
> SUSE Labs

Again, thanks for taking the time to read and discuss,

Zach

[1] https://lore.kernel.org/all/5127b9c-a147-8ef5-c942-ae8c755413d0@google.com/