mbox series

[RFC,00/26] hugetlb: Introduce HugeTLB high-granularity mapping

Message ID 20220624173656.2033256-1-jthoughton@google.com (mailing list archive)
Headers show
Series hugetlb: Introduce HugeTLB high-granularity mapping | expand

Message

James Houghton June 24, 2022, 5:36 p.m. UTC
This RFC introduces the concept of HugeTLB high-granularity mapping
(HGM)[1].  In broad terms, this series teaches HugeTLB how to map HugeTLB
pages at different granularities, and, more importantly, to partially map
a HugeTLB page.  This cover letter will go over
 - the motivation for these changes
 - userspace API
 - some of the changes to HugeTLB to make this work
 - limitations & future enhancements

High-granularity mapping does *not* involve dissolving the hugepages
themselves; it only affects how they are mapped.

---- Motivation ----

Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
cases in post-copy live migration and memory failure handling.

- Live Migration (userfaultfd)
For post-copy live migration, using userfaultfd, currently we have to
install an entire hugepage before we can allow a guest to access that page.
This is because, right now, either the WHOLE hugepage is mapped or NONE of
it is.  So either the guest can access the WHOLE hugepage or NONE of it.
This makes post-copy live migration for 1G HugeTLB-backed VMs completely
infeasible.

With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
thereby allowing the guest to access only PAGE_SIZE chunks, and getting
page faults on the rest (and triggering another demand-fetch). This gives
userspace the flexibility to install PAGE_SIZE chunks of memory into a
hugepage, making migration of 1G-backed VMs perfectly feasible, and it
vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.

At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
per-page median fetch latencies:
     4K: <100us
     2M: >10ms
Being able to unpause a vCPU 100x quicker is helpful for guest stability,
and being able to use 1G pages at all can significant improve steady-state
guest performance.

After fully copying a hugepage over the network, we will want to collapse
the mapping down to what it would normally be (e.g., one PUD for a 1G
page). Rather than having the kernel do this automatically, we leave it up
to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
the API that is being introduced for THPs[2]).

- Memory Failure
When a memory error is found within a HugeTLB page, it would be ideal if we
could unmap only the PAGE_SIZE section that contained the error. This is
what THPs are able to do. Using high-granularity mapping, we could do this,
but this isn't tackled in this patch series.

---- Userspace API ----

This patch series introduces a single way to take advantage of
high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
userspace to resolve MINOR page faults on shared VMAs.

To collapse a HugeTLB address range that has been mapped with several
UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
userspace to know when all pages (that they care about) have been fetched.

---- HugeTLB Changes ----

- Mapcount
The way mapcount is handled is different from the way that it was handled
before. If the PUD for a hugepage is not none, a hugepage's mapcount will
be increased. This scheme means that, for hugepages that aren't mapped at
high granularity, their mapcounts will remain the same as what they would
have been pre-HGM.

- Page table walking and manipulation
A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
high-granularity mappings. Eventually, it's possible to merge
hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.

We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
This is because we generally need to know the "size" of a PTE (previously
always just huge_page_size(hstate)).

For every page table manipulation function that has a huge version (e.g.
huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
PTE really is "huge".

- Synchronization
For existing bits of HugeTLB, synchronization is unchanged. For splitting
and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
writing, and for doing high-granularity page table walks, we require it to
be held for reading.

---- Limitations & Future Changes ----

This patch series only implements high-granularity mapping for VM_SHARED
VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
failure recovery for both shared and private mappings.

The memory failure use case poses its own challenges that can be
addressed, but I will do so in a separate RFC.

Performance has not been heavily scrutinized with this patch series. There
are places where lock contention can significantly reduce performance. This
will be addressed later.

The patch series, as it stands right now, is compatible with the VMEMMAP
page struct optimization[3], as we do not need to modify data contained
in the subpage page structs.

Other omissions:
 - Compatibility with userfaultfd write-protect (will be included in v1).
 - Support for mremap() (will be included in v1). This looks a lot like
   the support we have for fork().
 - Documentation changes (will be included in v1).
 - Completely ignores PMD sharing and hugepage migration (will be included
   in v1).
 - Implementations for architectures that don't use GENERAL_HUGETLB other
   than arm64.

---- Patch Breakdown ----

Patch 1     - Preliminary changes
Patch 2-10  - HugeTLB HGM core changes
Patch 11-13 - HugeTLB HGM page table walking functionality
Patch 14-19 - HugeTLB HGM compatibility with other bits
Patch 20-23 - Userfaultfd and collapse changes
Patch 24-26 - arm64 support and selftests

[1] This used to be called HugeTLB double mapping, a bad and confusing
    name. "High-granularity mapping" is not a great name either. I am open
    to better names.
[2] https://lore.kernel.org/linux-mm/20220604004004.954674-10-zokeefe@google.com/
[3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")

James Houghton (26):
  hugetlb: make hstate accessor functions const
  hugetlb: sort hstates in hugetlb_init_hstates
  hugetlb: add make_huge_pte_with_shift
  hugetlb: make huge_pte_lockptr take an explicit shift argument.
  hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  mm: make free_p?d_range functions public
  hugetlb: add hugetlb_pte to track HugeTLB page table entries
  hugetlb: add hugetlb_free_range to free PT structures
  hugetlb: add hugetlb_hgm_enabled
  hugetlb: add for_each_hgm_shift
  hugetlb: add hugetlb_walk_to to do PT walks
  hugetlb: add HugeTLB splitting functionality
  hugetlb: add huge_pte_alloc_high_granularity
  hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  hugetlb: make unmapping compatible with high-granularity mappings
  hugetlb: make hugetlb_change_protection compatible with HGM
  hugetlb: update follow_hugetlb_page to support HGM
  hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  hugetlb: add HGM support for copy_hugetlb_page_range
  hugetlb: add support for high-granularity UFFDIO_CONTINUE
  hugetlb: add hugetlb_collapse
  madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
  userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  arm64/hugetlb: add support for high-granularity mappings
  selftests: add HugeTLB HGM to userfaultfd selftest
  selftests: add HugeTLB HGM to KVM demand paging selftest

 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/hugetlbpage.c                   |  63 ++
 arch/powerpc/mm/pgtable.c                     |   3 +-
 arch/s390/mm/gmap.c                           |   8 +-
 fs/Kconfig                                    |   7 +
 fs/proc/task_mmu.c                            |  35 +-
 fs/userfaultfd.c                              |  10 +-
 include/asm-generic/tlb.h                     |   6 +-
 include/linux/hugetlb.h                       | 177 +++-
 include/linux/mm.h                            |   7 +
 include/linux/pagewalk.h                      |   3 +-
 include/uapi/asm-generic/mman-common.h        |   2 +
 include/uapi/linux/userfaultfd.h              |   2 +
 mm/damon/vaddr.c                              |  34 +-
 mm/hmm.c                                      |   7 +-
 mm/hugetlb.c                                  | 987 +++++++++++++++---
 mm/madvise.c                                  |  23 +
 mm/memory.c                                   |   8 +-
 mm/mempolicy.c                                |  11 +-
 mm/migrate.c                                  |   3 +-
 mm/mincore.c                                  |   4 +-
 mm/mprotect.c                                 |   6 +-
 mm/page_vma_mapped.c                          |   3 +-
 mm/pagewalk.c                                 |  18 +-
 mm/userfaultfd.c                              |  57 +-
 .../testing/selftests/kvm/include/test_util.h |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   2 +-
 tools/testing/selftests/kvm/lib/test_util.c   |  14 +
 tools/testing/selftests/vm/userfaultfd.c      |  61 +-
 29 files changed, 1314 insertions(+), 250 deletions(-)

Comments

Matthew Wilcox June 24, 2022, 6:29 p.m. UTC | #1
On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> [1] This used to be called HugeTLB double mapping, a bad and confusing
>     name. "High-granularity mapping" is not a great name either. I am open
>     to better names.

Oh good, I was grinding my teeth every time I read it ;-)

How does "Fine granularity" work for you?
"sub-page mapping" might work too.
Mina Almasry June 24, 2022, 6:41 p.m. UTC | #2
On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This RFC introduces the concept of HugeTLB high-granularity mapping
> (HGM)[1].  In broad terms, this series teaches HugeTLB how to map HugeTLB
> pages at different granularities, and, more importantly, to partially map
> a HugeTLB page.  This cover letter will go over
>  - the motivation for these changes
>  - userspace API
>  - some of the changes to HugeTLB to make this work
>  - limitations & future enhancements
>
> High-granularity mapping does *not* involve dissolving the hugepages
> themselves; it only affects how they are mapped.
>
> ---- Motivation ----
>
> Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
> cases in post-copy live migration and memory failure handling.
>
> - Live Migration (userfaultfd)
> For post-copy live migration, using userfaultfd, currently we have to
> install an entire hugepage before we can allow a guest to access that page.
> This is because, right now, either the WHOLE hugepage is mapped or NONE of
> it is.  So either the guest can access the WHOLE hugepage or NONE of it.
> This makes post-copy live migration for 1G HugeTLB-backed VMs completely
> infeasible.
>
> With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
> thereby allowing the guest to access only PAGE_SIZE chunks, and getting
> page faults on the rest (and triggering another demand-fetch). This gives
> userspace the flexibility to install PAGE_SIZE chunks of memory into a
> hugepage, making migration of 1G-backed VMs perfectly feasible, and it
> vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.
>
> At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
> per-page median fetch latencies:
>      4K: <100us
>      2M: >10ms
> Being able to unpause a vCPU 100x quicker is helpful for guest stability,
> and being able to use 1G pages at all can significant improve steady-state
> guest performance.
>
> After fully copying a hugepage over the network, we will want to collapse
> the mapping down to what it would normally be (e.g., one PUD for a 1G
> page). Rather than having the kernel do this automatically, we leave it up
> to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
> the API that is being introduced for THPs[2]).
>
> - Memory Failure
> When a memory error is found within a HugeTLB page, it would be ideal if we
> could unmap only the PAGE_SIZE section that contained the error. This is
> what THPs are able to do. Using high-granularity mapping, we could do this,
> but this isn't tackled in this patch series.
>
> ---- Userspace API ----
>
> This patch series introduces a single way to take advantage of
> high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> userspace to resolve MINOR page faults on shared VMAs.
>
> To collapse a HugeTLB address range that has been mapped with several
> UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> userspace to know when all pages (that they care about) have been fetched.
>

Thanks James! Cover letter looks good. A few questions:

Why not have the kernel collapse the hugepage once all the 4K pages
have been fetched automatically? It would remove the need for a new
userspace API, and AFACT there aren't really any cases where it is
beneficial to have a hugepage sharded into 4K mappings when those
mappings can be collapsed.

> ---- HugeTLB Changes ----
>
> - Mapcount
> The way mapcount is handled is different from the way that it was handled
> before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> be increased. This scheme means that, for hugepages that aren't mapped at
> high granularity, their mapcounts will remain the same as what they would
> have been pre-HGM.
>

Sorry, I didn't quite follow this. It says mapcount is handled
differently, but the same if the page is not mapped at high
granularity. Can you elaborate on how the mapcount handling will be
different when the page is mapped at high granularity?

> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
>
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
>
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> PTE really is "huge".
>
> - Synchronization
> For existing bits of HugeTLB, synchronization is unchanged. For splitting
> and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> writing, and for doing high-granularity page table walks, we require it to
> be held for reading.
>
> ---- Limitations & Future Changes ----
>
> This patch series only implements high-granularity mapping for VM_SHARED
> VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> failure recovery for both shared and private mappings.
>
> The memory failure use case poses its own challenges that can be
> addressed, but I will do so in a separate RFC.
>
> Performance has not been heavily scrutinized with this patch series. There
> are places where lock contention can significantly reduce performance. This
> will be addressed later.
>
> The patch series, as it stands right now, is compatible with the VMEMMAP
> page struct optimization[3], as we do not need to modify data contained
> in the subpage page structs.
>
> Other omissions:
>  - Compatibility with userfaultfd write-protect (will be included in v1).
>  - Support for mremap() (will be included in v1). This looks a lot like
>    the support we have for fork().
>  - Documentation changes (will be included in v1).
>  - Completely ignores PMD sharing and hugepage migration (will be included
>    in v1).
>  - Implementations for architectures that don't use GENERAL_HUGETLB other
>    than arm64.
>
> ---- Patch Breakdown ----
>
> Patch 1     - Preliminary changes
> Patch 2-10  - HugeTLB HGM core changes
> Patch 11-13 - HugeTLB HGM page table walking functionality
> Patch 14-19 - HugeTLB HGM compatibility with other bits
> Patch 20-23 - Userfaultfd and collapse changes
> Patch 24-26 - arm64 support and selftests
>
> [1] This used to be called HugeTLB double mapping, a bad and confusing
>     name. "High-granularity mapping" is not a great name either. I am open
>     to better names.

I would drop 1 extra word and do "granular mapping", as in the mapping
is more granular than what it normally is (2MB/1G, etc).

> [2] https://lore.kernel.org/linux-mm/20220604004004.954674-10-zokeefe@google.com/
> [3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>
> James Houghton (26):
>   hugetlb: make hstate accessor functions const
>   hugetlb: sort hstates in hugetlb_init_hstates
>   hugetlb: add make_huge_pte_with_shift
>   hugetlb: make huge_pte_lockptr take an explicit shift argument.
>   hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
>   mm: make free_p?d_range functions public
>   hugetlb: add hugetlb_pte to track HugeTLB page table entries
>   hugetlb: add hugetlb_free_range to free PT structures
>   hugetlb: add hugetlb_hgm_enabled
>   hugetlb: add for_each_hgm_shift
>   hugetlb: add hugetlb_walk_to to do PT walks
>   hugetlb: add HugeTLB splitting functionality
>   hugetlb: add huge_pte_alloc_high_granularity
>   hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
>   hugetlb: make unmapping compatible with high-granularity mappings
>   hugetlb: make hugetlb_change_protection compatible with HGM
>   hugetlb: update follow_hugetlb_page to support HGM
>   hugetlb: use struct hugetlb_pte for walk_hugetlb_range
>   hugetlb: add HGM support for copy_hugetlb_page_range
>   hugetlb: add support for high-granularity UFFDIO_CONTINUE
>   hugetlb: add hugetlb_collapse
>   madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
>   userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
>   arm64/hugetlb: add support for high-granularity mappings
>   selftests: add HugeTLB HGM to userfaultfd selftest
>   selftests: add HugeTLB HGM to KVM demand paging selftest
>
>  arch/arm64/Kconfig                            |   1 +
>  arch/arm64/mm/hugetlbpage.c                   |  63 ++
>  arch/powerpc/mm/pgtable.c                     |   3 +-
>  arch/s390/mm/gmap.c                           |   8 +-
>  fs/Kconfig                                    |   7 +
>  fs/proc/task_mmu.c                            |  35 +-
>  fs/userfaultfd.c                              |  10 +-
>  include/asm-generic/tlb.h                     |   6 +-
>  include/linux/hugetlb.h                       | 177 +++-
>  include/linux/mm.h                            |   7 +
>  include/linux/pagewalk.h                      |   3 +-
>  include/uapi/asm-generic/mman-common.h        |   2 +
>  include/uapi/linux/userfaultfd.h              |   2 +
>  mm/damon/vaddr.c                              |  34 +-
>  mm/hmm.c                                      |   7 +-
>  mm/hugetlb.c                                  | 987 +++++++++++++++---
>  mm/madvise.c                                  |  23 +
>  mm/memory.c                                   |   8 +-
>  mm/mempolicy.c                                |  11 +-
>  mm/migrate.c                                  |   3 +-
>  mm/mincore.c                                  |   4 +-
>  mm/mprotect.c                                 |   6 +-
>  mm/page_vma_mapped.c                          |   3 +-
>  mm/pagewalk.c                                 |  18 +-
>  mm/userfaultfd.c                              |  57 +-
>  .../testing/selftests/kvm/include/test_util.h |   2 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |   2 +-
>  tools/testing/selftests/kvm/lib/test_util.c   |  14 +
>  tools/testing/selftests/vm/userfaultfd.c      |  61 +-
>  29 files changed, 1314 insertions(+), 250 deletions(-)
>
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>
Matthew Wilcox June 24, 2022, 6:47 p.m. UTC | #3
On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> 
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
> 
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> PTE really is "huge".

I'm disappointed to hear that page table walking is going to become even
more special.  I'd much prefer it if hugetlb walking were exactly the
same as THP walking.  This seems like a good time to do at least some
of that work.

Was there a reason you chose the "more complexity" direction?
James Houghton June 27, 2022, 4:27 p.m. UTC | #4
On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > [trimmed...]
> > ---- Userspace API ----
> >
> > This patch series introduces a single way to take advantage of
> > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > userspace to resolve MINOR page faults on shared VMAs.
> >
> > To collapse a HugeTLB address range that has been mapped with several
> > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > userspace to know when all pages (that they care about) have been fetched.
> >
>
> Thanks James! Cover letter looks good. A few questions:
>
> Why not have the kernel collapse the hugepage once all the 4K pages
> have been fetched automatically? It would remove the need for a new
> userspace API, and AFACT there aren't really any cases where it is
> beneficial to have a hugepage sharded into 4K mappings when those
> mappings can be collapsed.

The reason that we don't automatically collapse mappings is because it
would take additional complexity, and it is less flexible. Consider
the case of 1G pages on x86: currently, userspace can collapse the
whole page when it's all ready, but they can also choose to collapse a
2M piece of it. On architectures with more supported hugepage sizes
(e.g., arm64), userspace has even more possibilities for when to
collapse. This likely further complicates a potential
automatic-collapse solution. Userspace may also want to collapse the
mapping for an entire hugepage without completely mapping the hugepage
first (this would also be possible by issuing UFFDIO_CONTINUE on all
the holes, though).

>
> > ---- HugeTLB Changes ----
> >
> > - Mapcount
> > The way mapcount is handled is different from the way that it was handled
> > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > be increased. This scheme means that, for hugepages that aren't mapped at
> > high granularity, their mapcounts will remain the same as what they would
> > have been pre-HGM.
> >
>
> Sorry, I didn't quite follow this. It says mapcount is handled
> differently, but the same if the page is not mapped at high
> granularity. Can you elaborate on how the mapcount handling will be
> different when the page is mapped at high granularity?

I guess I didn't phrase this very well. For the sake of simplicity,
consider 1G pages on x86, typically mapped with leaf-level PUDs.
Previously, there were two possibilities for how a hugepage was
mapped, either it was (1) completely mapped (PUD is present and a
leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
case, where the PUD is not none but also not a leaf (this usually
means that the page is partially mapped). We handle this case as if
the whole page was mapped. That is, if we partially map a hugepage
that was previously unmapped (making the PUD point to PMDs), we
increment its mapcount, and if we completely unmap a partially mapped
hugepage (making the PUD none), we decrement its mapcount. If we
collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.

It is possible for a PUD to be present and not a leaf (mapcount has
been incremented) but for the page to still be unmapped: if the PMDs
(or PTEs) underneath are all none. This case is atypical, and as of
this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
think it would be very difficult to get this to happen.

>
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
> >
> > - Synchronization
> > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > writing, and for doing high-granularity page table walks, we require it to
> > be held for reading.
> >
> > ---- Limitations & Future Changes ----
> >
> > This patch series only implements high-granularity mapping for VM_SHARED
> > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > failure recovery for both shared and private mappings.
> >
> > The memory failure use case poses its own challenges that can be
> > addressed, but I will do so in a separate RFC.
> >
> > Performance has not been heavily scrutinized with this patch series. There
> > are places where lock contention can significantly reduce performance. This
> > will be addressed later.
> >
> > The patch series, as it stands right now, is compatible with the VMEMMAP
> > page struct optimization[3], as we do not need to modify data contained
> > in the subpage page structs.
> >
> > Other omissions:
> >  - Compatibility with userfaultfd write-protect (will be included in v1).
> >  - Support for mremap() (will be included in v1). This looks a lot like
> >    the support we have for fork().
> >  - Documentation changes (will be included in v1).
> >  - Completely ignores PMD sharing and hugepage migration (will be included
> >    in v1).
> >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> >    than arm64.
> >
> > ---- Patch Breakdown ----
> >
> > Patch 1     - Preliminary changes
> > Patch 2-10  - HugeTLB HGM core changes
> > Patch 11-13 - HugeTLB HGM page table walking functionality
> > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > Patch 20-23 - Userfaultfd and collapse changes
> > Patch 24-26 - arm64 support and selftests
> >
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> >     name. "High-granularity mapping" is not a great name either. I am open
> >     to better names.
>
> I would drop 1 extra word and do "granular mapping", as in the mapping
> is more granular than what it normally is (2MB/1G, etc).

Noted. :)
James Houghton June 27, 2022, 4:36 p.m. UTC | #5
On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> >     name. "High-granularity mapping" is not a great name either. I am open
> >     to better names.
>
> Oh good, I was grinding my teeth every time I read it ;-)
>
> How does "Fine granularity" work for you?
> "sub-page mapping" might work too.

"Granularity", as I've come to realize, is hard to say, so I think I
prefer sub-page mapping. :) So to recap the suggestions I have so far:

1. Sub-page mapping
2. Granular mapping
3. Flexible mapping

I'll pick one of these (or maybe some other one that works better) for
the next version of this series.
James Houghton June 27, 2022, 4:48 p.m. UTC | #6
On Fri, Jun 24, 2022 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
>
> I'm disappointed to hear that page table walking is going to become even
> more special.  I'd much prefer it if hugetlb walking were exactly the
> same as THP walking.  This seems like a good time to do at least some
> of that work.
>
> Was there a reason you chose the "more complexity" direction?

I chose this direction because it seemed to be the most
straightforward to get to a working prototype and then to an RFC. I
agree with your sentiment -- I'll see what I can do to reconcile THP
walking with HugeTLB(+HGM) walking.
Dr. David Alan Gilbert June 27, 2022, 5:56 p.m. UTC | #7
* James Houghton (jthoughton@google.com) wrote:
> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > Oh good, I was grinding my teeth every time I read it ;-)
> >
> > How does "Fine granularity" work for you?
> > "sub-page mapping" might work too.
> 
> "Granularity", as I've come to realize, is hard to say, so I think I
> prefer sub-page mapping. :) So to recap the suggestions I have so far:
> 
> 1. Sub-page mapping
> 2. Granular mapping
> 3. Flexible mapping
> 
> I'll pick one of these (or maybe some other one that works better) for
> the next version of this series.

<shrug> Just a name; SPM might work (although may confuse those
architectures which had subprotection for normal pages), and at least
we can mispronounce it.

In 14/26 your commit message says:

  1. Faults can be passed to handle_userfault. (Userspace will want to
     use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
     region they should be call UFFDIO_CONTINUE on later.)

can you explain what that new UFFD_FEATURE does?

Dave
James Houghton June 27, 2022, 8:31 p.m. UTC | #8
On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * James Houghton (jthoughton@google.com) wrote:
> > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > Oh good, I was grinding my teeth every time I read it ;-)
> > >
> > > How does "Fine granularity" work for you?
> > > "sub-page mapping" might work too.
> >
> > "Granularity", as I've come to realize, is hard to say, so I think I
> > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> >
> > 1. Sub-page mapping
> > 2. Granular mapping
> > 3. Flexible mapping
> >
> > I'll pick one of these (or maybe some other one that works better) for
> > the next version of this series.
>
> <shrug> Just a name; SPM might work (although may confuse those
> architectures which had subprotection for normal pages), and at least
> we can mispronounce it.
>
> In 14/26 your commit message says:
>
>   1. Faults can be passed to handle_userfault. (Userspace will want to
>      use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>      region they should be call UFFDIO_CONTINUE on later.)
>
> can you explain what that new UFFD_FEATURE does?

+cc Nadav Amit <namit@vmware.com> to check me here.

Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
feature, and it actually isn't needed (I will correct the commit
message). Why it isn't needed is a little bit complicated, though. Let
me explain:

Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
userfaultfd gave userspace for HugeTLB pages was rounded down to be
hstate-size-aligned. This would have had to change, because userspace,
to take advantage of HGM, needs to know which 4K piece to install.

However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
address was rounded down to be PAGE_SIZE-aligned instead, even if the
flag wasn't used. I think this was an unintended change. If the flag
is used, then the address isn't rounded at all -- that was the
intended purpose of this flag. Hope that makes sense.

The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
userspace that high-granularity CONTINUEs are available.

[1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")


>
> Dave
>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Nadav Amit June 28, 2022, 12:04 a.m. UTC | #9
> On Jun 27, 2022, at 1:31 PM, James Houghton <jthoughton@google.com> wrote:
> 
> ⚠ External Email
> 
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
>> 
>> * James Houghton (jthoughton@google.com) wrote:
>>> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
>>>>> [1] This used to be called HugeTLB double mapping, a bad and confusing
>>>>> name. "High-granularity mapping" is not a great name either. I am open
>>>>> to better names.
>>>> 
>>>> Oh good, I was grinding my teeth every time I read it ;-)
>>>> 
>>>> How does "Fine granularity" work for you?
>>>> "sub-page mapping" might work too.
>>> 
>>> "Granularity", as I've come to realize, is hard to say, so I think I
>>> prefer sub-page mapping. :) So to recap the suggestions I have so far:
>>> 
>>> 1. Sub-page mapping
>>> 2. Granular mapping
>>> 3. Flexible mapping
>>> 
>>> I'll pick one of these (or maybe some other one that works better) for
>>> the next version of this series.
>> 
>> <shrug> Just a name; SPM might work (although may confuse those
>> architectures which had subprotection for normal pages), and at least
>> we can mispronounce it.
>> 
>> In 14/26 your commit message says:
>> 
>> 1. Faults can be passed to handle_userfault. (Userspace will want to
>> use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>> region they should be call UFFDIO_CONTINUE on later.)
>> 
>> can you explain what that new UFFD_FEATURE does?
> 
> +cc Nadav Amit <namit@vmware.com> to check me here.
> 
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
> 
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
> 
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.
> 
> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
> 
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")

Indeed this change of behavior (not aligning to huge-pages when flags is
not set) was unintentional. If you want to fix it in a separate patch so
it would be backported, that may be a good idea.

For the record, there was a short period of time in 2016 when the exact
fault address was delivered even when UFFD_FEATURE_EXACT_ADDRESS was not
provided. We had some arguments whether this was a regression...

BTW: I should have thought on the use-case of knowing the exact address
in huge-pages. It would have shorten my discussions with Andrea on whether
this feature (UFFD_FEATURE_EXACT_ADDRESS) is needed. :)
Dr. David Alan Gilbert June 28, 2022, 8:20 a.m. UTC | #10
* James Houghton (jthoughton@google.com) wrote:
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * James Houghton (jthoughton@google.com) wrote:
> > > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > > >     to better names.
> > > >
> > > > Oh good, I was grinding my teeth every time I read it ;-)
> > > >
> > > > How does "Fine granularity" work for you?
> > > > "sub-page mapping" might work too.
> > >
> > > "Granularity", as I've come to realize, is hard to say, so I think I
> > > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> > >
> > > 1. Sub-page mapping
> > > 2. Granular mapping
> > > 3. Flexible mapping
> > >
> > > I'll pick one of these (or maybe some other one that works better) for
> > > the next version of this series.
> >
> > <shrug> Just a name; SPM might work (although may confuse those
> > architectures which had subprotection for normal pages), and at least
> > we can mispronounce it.
> >
> > In 14/26 your commit message says:
> >
> >   1. Faults can be passed to handle_userfault. (Userspace will want to
> >      use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
> >      region they should be call UFFDIO_CONTINUE on later.)
> >
> > can you explain what that new UFFD_FEATURE does?
> 
> +cc Nadav Amit <namit@vmware.com> to check me here.
> 
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
> 
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
> 
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.

Oh that's 'fun'; right but the need for the less-rounded address makes
sense.

One other thing I thought of; you provide the modified 'CONTINUE'
behaviour, which works for postcopy as long as you use two mappings in
userspace; one protected by userfault, and one which you do the writes
to, and then issue the CONTINUE into the protected mapping; that's fine,
but it's not currently how we have our postcopy code wired up in qemu,
we have one mapping and use UFFDIO_COPY to place the page.
Requiring the two mappings is fine, but it's probably worth pointing out
the need for it somewhere.

Dave

> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
> 
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
> 
> 
> >
> > Dave
> >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>
Muchun Song June 28, 2022, 2:17 p.m. UTC | #11
On Mon, Jun 27, 2022 at 09:27:38AM -0700, James Houghton wrote:
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
> 
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
> 
> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled

+1

> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
> 
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> 
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
> 

It is a good explanation. I think it is better to go to cover letter.

Thanks.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > >  - Support for mremap() (will be included in v1). This looks a lot like
> > >    the support we have for fork().
> > >  - Documentation changes (will be included in v1).
> > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > >    in v1).
> > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > >    than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1     - Preliminary changes
> > > Patch 2-10  - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
> 
> Noted. :)
>
Mina Almasry June 28, 2022, 5:26 p.m. UTC | #12
On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
>
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
>

To be honest I'm don't think I'm a fan of this. I don't think this
saves complexity, but rather pushes it to the userspace. I.e. the
userspace now must track which regions are faulted in and which are
not to call MADV_COLLAPSE at the right time. Also, if the userspace
gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
any hugepages) or call MADV_COLLAPSE too early and have to deal with a
storm of maybe hundreds of minor faults at once which may take too
long to resolve and may impact guest stability, yes?

For these reasons I think automatic collapsing is something that will
eventually be implemented by us or someone else, and at that point
MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
is adding a userspace API that will probably need to be maintained for
perpetuity but actually is likely going to be going obsolete "soon".
For this reason I had hoped that automatic collapsing would come with
V1.

I wonder if we can have a very simple first try at automatic
collapsing for V1? I.e., can we support collapsing to the hstate size
and only that? So 4K pages can only be either collapsed to 2MB or 1G
on x86 depending on the hstate size. I think this may be not too
difficult to implement: we can have a counter similar to mapcount that
tracks how many of the subpages are mapped (subpage_mapcount). Once
all the subpages are mapped (the counter reaches a certain value),
trigger collapsing similar to hstate size MADV_COLLAPSE.

I gather that no one else reviewing this has raised this issue thus
far so it might not be a big deal and I will continue to review the
RFC, but I had hoped for automatic collapsing myself for the reasons
above.

> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled
> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
>
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
>
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
>

Thank you for the detailed explanation. Please add it to the cover letter.

I wonder the case "PUD present but all the PMD are none": is that a
bug? I don't understand the usefulness of that. Not a comment on this
patch but rather a curiosity.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > >  - Support for mremap() (will be included in v1). This looks a lot like
> > >    the support we have for fork().
> > >  - Documentation changes (will be included in v1).
> > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > >    in v1).
> > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > >    than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1     - Preliminary changes
> > > Patch 2-10  - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
>
> Noted. :)
Dr. David Alan Gilbert June 28, 2022, 5:56 p.m. UTC | #13
* Mina Almasry (almasrymina@google.com) wrote:
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
> 
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I think it depends on whether the userspace is already holding bitmaps
and data structures to let it know when the right time to call collapse
is; if it already has to do all that book keeping for it's own postcopy
or whatever process, then getting userspace to call it is easy.
(I don't know the answer to whether it does have!)

Dave

> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
> 
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.
> 
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
> 
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
> 
> Thank you for the detailed explanation. Please add it to the cover letter.
> 
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
> 
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > >    the support we have for fork().
> > > >  - Documentation changes (will be included in v1).
> > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > >    in v1).
> > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > >    than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1     - Preliminary changes
> > > > Patch 2-10  - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)
>
James Houghton June 29, 2022, 6:31 p.m. UTC | #14
On Tue, Jun 28, 2022 at 10:56 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Mina Almasry (almasrymina@google.com) wrote:
> > On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > > >
> > > > > [trimmed...]
> > > > > ---- Userspace API ----
> > > > >
> > > > > This patch series introduces a single way to take advantage of
> > > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > > userspace to resolve MINOR page faults on shared VMAs.
> > > > >
> > > > > To collapse a HugeTLB address range that has been mapped with several
> > > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > > userspace to know when all pages (that they care about) have been fetched.
> > > > >
> > > >
> > > > Thanks James! Cover letter looks good. A few questions:
> > > >
> > > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > > have been fetched automatically? It would remove the need for a new
> > > > userspace API, and AFACT there aren't really any cases where it is
> > > > beneficial to have a hugepage sharded into 4K mappings when those
> > > > mappings can be collapsed.
> > >
> > > The reason that we don't automatically collapse mappings is because it
> > > would take additional complexity, and it is less flexible. Consider
> > > the case of 1G pages on x86: currently, userspace can collapse the
> > > whole page when it's all ready, but they can also choose to collapse a
> > > 2M piece of it. On architectures with more supported hugepage sizes
> > > (e.g., arm64), userspace has even more possibilities for when to
> > > collapse. This likely further complicates a potential
> > > automatic-collapse solution. Userspace may also want to collapse the
> > > mapping for an entire hugepage without completely mapping the hugepage
> > > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > > the holes, though).
> > >
> >
> > To be honest I'm don't think I'm a fan of this. I don't think this
> > saves complexity, but rather pushes it to the userspace. I.e. the
> > userspace now must track which regions are faulted in and which are
> > not to call MADV_COLLAPSE at the right time. Also, if the userspace
> > gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> > any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> > storm of maybe hundreds of minor faults at once which may take too
> > long to resolve and may impact guest stability, yes?
>
> I think it depends on whether the userspace is already holding bitmaps
> and data structures to let it know when the right time to call collapse
> is; if it already has to do all that book keeping for it's own postcopy
> or whatever process, then getting userspace to call it is easy.
> (I don't know the answer to whether it does have!)

Userspace generally has a lot of information about which pages have
been UFFDIO_CONTINUE'd, but they may not have the information (say,
some atomic count per hpage) to tell them exactly when to collapse.

I think it's worth discussing the tmpfs/THP case right now, too. Right
now, after userfaultfd post-copy, all THPs we have will all be
PTE-mapped. To deal with this, we need to use Zach's MADV_COLLAPSE to
collapse the mappings to PMD mappings (we don't want to wait for
khugepaged to happen upon them -- we want good performance ASAP :)).
In fact, IIUC, khugepaged actually won't collapse these *ever* right
now. I suppose we could enlighten tmpfs's UFFDIO_CONTINUE to
automatically collapse too (thus avoiding the need for MADV_COLLAPSE),
but that could be complicated/unwanted (if that is something we might
want, maybe we should have a separate discussion).

So, as it stands today, we intend to use MADV_COLLAPSE explicitly in
the tmpfs case as soon as it is supported, and so it follows that it's
ok to require userspace to do the same thing for HugeTLBFS-backed
memory.

>
> Dave
>
> > For these reasons I think automatic collapsing is something that will
> > eventually be implemented by us or someone else, and at that point
> > MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> > is adding a userspace API that will probably need to be maintained for
> > perpetuity but actually is likely going to be going obsolete "soon".
> > For this reason I had hoped that automatic collapsing would come with
> > V1.

Small, unimportant clarification: the API, as described here, won't be
*completely* meaningless if we end up implementing automatic
collapsing :) It still has the effect of not requiring other
UFFDIO_CONTINUE operations to be done for the collapsed region.

> >
> > I wonder if we can have a very simple first try at automatic
> > collapsing for V1? I.e., can we support collapsing to the hstate size
> > and only that? So 4K pages can only be either collapsed to 2MB or 1G
> > on x86 depending on the hstate size. I think this may be not too
> > difficult to implement: we can have a counter similar to mapcount that
> > tracks how many of the subpages are mapped (subpage_mapcount). Once
> > all the subpages are mapped (the counter reaches a certain value),
> > trigger collapsing similar to hstate size MADV_COLLAPSE.
> >

In my estimation, to implement automatic collapsing, for one VMA, we
will need a per-hstate count, where when the count reaches the maximum
number, we collapse automatically to the next most optimal size. So if
we finish filling in enough PTEs for a CONT_PTE, we will collapse to a
CONT_PTE. If we finish filling up CONT_PTEs to a PMD, then collapse to
a PMD.

If you are suggesting to only collapse to the hstate size at the end,
then we lose flexibility.

> > I gather that no one else reviewing this has raised this issue thus
> > far so it might not be a big deal and I will continue to review the
> > RFC, but I had hoped for automatic collapsing myself for the reasons
> > above.

Thanks for the thorough review, Mina. :)

> >
> > > >
> > > > > ---- HugeTLB Changes ----
> > > > >
> > > > > - Mapcount
> > > > > The way mapcount is handled is different from the way that it was handled
> > > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > > high granularity, their mapcounts will remain the same as what they would
> > > > > have been pre-HGM.
> > > > >
> > > >
> > > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > > differently, but the same if the page is not mapped at high
> > > > granularity. Can you elaborate on how the mapcount handling will be
> > > > different when the page is mapped at high granularity?
> > >
> > > I guess I didn't phrase this very well. For the sake of simplicity,
> > > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > > Previously, there were two possibilities for how a hugepage was
> > > mapped, either it was (1) completely mapped (PUD is present and a
> > > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > > case, where the PUD is not none but also not a leaf (this usually
> > > means that the page is partially mapped). We handle this case as if
> > > the whole page was mapped. That is, if we partially map a hugepage
> > > that was previously unmapped (making the PUD point to PMDs), we
> > > increment its mapcount, and if we completely unmap a partially mapped
> > > hugepage (making the PUD none), we decrement its mapcount. If we
> > > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> > >
> > > It is possible for a PUD to be present and not a leaf (mapcount has
> > > been incremented) but for the page to still be unmapped: if the PMDs
> > > (or PTEs) underneath are all none. This case is atypical, and as of
> > > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > > think it would be very difficult to get this to happen.
> > >
> >
> > Thank you for the detailed explanation. Please add it to the cover letter.
> >
> > I wonder the case "PUD present but all the PMD are none": is that a
> > bug? I don't understand the usefulness of that. Not a comment on this
> > patch but rather a curiosity.
> >
> > > >
> > > > > - Page table walking and manipulation
> > > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > > high-granularity mappings. Eventually, it's possible to merge
> > > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > > >
> > > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > > This is because we generally need to know the "size" of a PTE (previously
> > > > > always just huge_page_size(hstate)).
> > > > >
> > > > > For every page table manipulation function that has a huge version (e.g.
> > > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > > PTE really is "huge".
> > > > >
> > > > > - Synchronization
> > > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > > writing, and for doing high-granularity page table walks, we require it to
> > > > > be held for reading.
> > > > >
> > > > > ---- Limitations & Future Changes ----
> > > > >
> > > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > > failure recovery for both shared and private mappings.
> > > > >
> > > > > The memory failure use case poses its own challenges that can be
> > > > > addressed, but I will do so in a separate RFC.
> > > > >
> > > > > Performance has not been heavily scrutinized with this patch series. There
> > > > > are places where lock contention can significantly reduce performance. This
> > > > > will be addressed later.
> > > > >
> > > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > > page struct optimization[3], as we do not need to modify data contained
> > > > > in the subpage page structs.
> > > > >
> > > > > Other omissions:
> > > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > > >    the support we have for fork().
> > > > >  - Documentation changes (will be included in v1).
> > > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > > >    in v1).
> > > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > >    than arm64.
> > > > >
> > > > > ---- Patch Breakdown ----
> > > > >
> > > > > Patch 1     - Preliminary changes
> > > > > Patch 2-10  - HugeTLB HGM core changes
> > > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > > Patch 24-26 - arm64 support and selftests
> > > > >
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > > >     to better names.
> > > >
> > > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > > is more granular than what it normally is (2MB/1G, etc).
> > >
> > > Noted. :)
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Axel Rasmussen June 29, 2022, 8:39 p.m. UTC | #15
On Tue, Jun 28, 2022 at 10:27 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
>
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I disagree, I think this is state userspace needs to maintain anyway,
even if we ignore the use case James' series is about.

One example: today, you can't UFFDIO_CONTINUE a region which is
already mapped - you'll get -EEXIST. So, userspace needs to be sure
not to double-continue an area. We could think about relaxing this,
but there's a tradeoff - being more permissive means it's "easier to
use", but, it also means we're less strict about catching potentially
buggy userspaces.

There's another case that I don't see any way to get rid of. The way
live migration at least for GCE works is, we have two things
installing new pages: the on-demand fetcher, which reacts to UFFD
events and resolves them. And then we have the background fetcher,
which goes along and fetches pages which haven't been touched /
requested yet (and which may never be, it's not uncommon for a guest
to have at least *some* pages which are very infrequently / never
touched). In order for the background fetcher to know what pages to
transfer over the network, or not, userspace has to remember which
ones it's already installed.

Another point is, consider the use case of UFFDIO_CONTINUE over
UFFDIO_COPY. When userspace gets a UFFD event for a page, the
assumption is that it's somewhat likely the page is already up to
date, because we already copied it over from the source machine before
we stopped the guest and restarted it running on the target machine
("precopy"). So, we want to maintain a dirty bitmap, which tells us
which pages are clean or not - when we get a UFFD event, we check the
bitmap, and only if the page is dirty do we actually go fetch it over
the network - otherwise we just UFFDIO_CONTINUE and we're done.

>
> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
>
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.

I'm not sure I agree this is likely.

Two problems:

One is, say you UFFDIO_CONTINUE a 4k PTE. If we wanted collapsing to
happen automatically, we'd need to answer the question: is this the
last 4k PTE in a 2M region, so now it can be collapsed? Today the only
way to know is to go check - walk the PTEs. This is expensive, and
it's something we'd have to do on each and every UFFDIO_CONTINUE
operation -- this sucks because we're incurring the cost on every
operation, even though most of them (only 1 / 512, say) the answer
will be "no it wasn't the last one, we can't collapse yet". For
on-demand paging, it's really critical installing the page is as fast
as possible -- in an ideal world it would be exactly as fast as a
"normal" minor fault and the guest would not even be able to tell at
all that it was in the process of being migrated.

Now, as you pointed out, we can just store a mapcount somewhere which
keeps track of how many PTEs in each 2M region are installed or not.
So, then we can more quickly check in UFFDIO_CONTINUE. But, we have
the memory overhead and CPU time overhead of maintaining this
metadata. And, it's not like having the kernel do this means userspace
doesn't have to - like I described above, I think userspace would
*also* need to keep track of this same thing anyway, so now we're
doing it 2x.

Another problem I see is, it seems like collapsing automatically would
involve letting UFFD know a bit too much for my liking about hugetlbfs
internals. It seems to me more ideal to have it know as little as
possible about how hugetlbfs works internally.



Also, there are some benefits to letting userspace decide when / if to
collapse.

For example, userspace might decide it prefers to MADV_COLLAPSE
immediately, in the demand paging thread. Or, it might decide it's
okay to let it be collapsed a bit later, and leave that up to some
other background thread. It might MADV_COLLAPSE as soon as it sees a
complete 2M region, or maybe it wants to batch things up and waits
until it has a full 1G region to collapse. It might also do different
things for different regions, e.g. depending on if they were hot or
cold (demand paged vs. background fetched). I don't see any single
"right way" to do things here, I just see tradeoffs, which userspace
is in a good position to decide on.

>
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
>
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
>
> Thank you for the detailed explanation. Please add it to the cover letter.
>
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
>
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > >    the support we have for fork().
> > > >  - Documentation changes (will be included in v1).
> > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > >    in v1).
> > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > >    than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1     - Preliminary changes
> > > > Patch 2-10  - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)
Peter Xu June 30, 2022, 4:09 p.m. UTC | #16
On Tue, Jun 28, 2022 at 09:20:41AM +0100, Dr. David Alan Gilbert wrote:
> One other thing I thought of; you provide the modified 'CONTINUE'
> behaviour, which works for postcopy as long as you use two mappings in
> userspace; one protected by userfault, and one which you do the writes
> to, and then issue the CONTINUE into the protected mapping; that's fine,
> but it's not currently how we have our postcopy code wired up in qemu,
> we have one mapping and use UFFDIO_COPY to place the page.
> Requiring the two mappings is fine, but it's probably worth pointing out
> the need for it somewhere.

It'll be about CONTINUE, maybe not directly related to sub-page mapping,
but indeed that's something we may need to do.  It's also in my poc [1]
previously (I never got time to get back to it yet though..).

It's just that two mappings are not required.  E.g., one could use a fd on
the file and lseek()/write() to the file to update content rather than
using another mapping.  It might be just slower.

Or, IMHO an app can legally just delay faulting of some mapping using minor
mode and maybe the app doesn't even need to modify the page content before
CONTINUE for some reason, then it's even not needed to have either the
other mapping or the fd.  Fundamentally, MINOR mode and CONTINUE provides
another way to trap page fault when page cache existed.  It doesn't really
define whether or how the data will be modified.

It's just that for QEMU unfortunately we may need to have that two mappings
just for this use case indeed..

[1] https://github.com/xzpeter/qemu/commit/41538a9a8ff5c981af879afe48e4ecca9a1aabc8

Thanks,
Peter Xu June 30, 2022, 7:21 p.m. UTC | #17
On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
> > [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
> 
> Indeed this change of behavior (not aligning to huge-pages when flags is
> not set) was unintentional. If you want to fix it in a separate patch so
> it would be backported, that may be a good idea.

The fix seems to be straightforward, though.  Nadav, wanna post a patch
yourself?

That seems to be an accident and it's just that having sub-page mapping
rely on the accident is probably not desirable..  So irrelevant of the
separate patch I'd suggest we keep the requirement on enabling the exact
addr feature for sub-page mapping.

Thanks,
Nadav Amit July 1, 2022, 5:54 a.m. UTC | #18
On Jun 30, 2022, at 12:21 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
>>> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
>> 
>> Indeed this change of behavior (not aligning to huge-pages when flags is
>> not set) was unintentional. If you want to fix it in a separate patch so
>> it would be backported, that may be a good idea.
> 
> The fix seems to be straightforward, though.  Nadav, wanna post a patch
> yourself?

Yes, even I can do it :)

Just busy right now, so I’ll try to do it over the weekend.