mbox series

[v3,00/12] mm/gup: Unify hugetlb, part 2

Message ID 20240321220802.679544-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm/gup: Unify hugetlb, part 2 | expand

Message

Peter Xu March 21, 2024, 10:07 p.m. UTC
From: Peter Xu <peterx@redhat.com>

v3:
- Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
- Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
  pXd_huge() users) with pXd_leaf() [Jason]
- Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
- Use IS_ENABLED() in follow_huge_pud() [Jason]
- Remove redundant none pud check in follow_pud_mask() [Jason]

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (12):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 include/linux/huge_mm.h |  25 +--
 include/linux/hugetlb.h |  16 +-
 include/linux/mm.h      |   3 +
 mm/Kconfig              |   6 +
 mm/gup.c                | 354 +++++++++++++++++++++++++++++++++-------
 mm/huge_memory.c        | 133 +--------------
 mm/hugetlb.c            |  75 +--------
 mm/internal.h           |   7 +-
 mm/memory.c             |  12 ++
 9 files changed, 337 insertions(+), 294 deletions(-)

Comments

Jason Gunthorpe March 22, 2024, 12:28 p.m. UTC | #1
On Thu, Mar 21, 2024 at 06:07:55PM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Jason Gunthorpe March 22, 2024, 4:10 p.m. UTC | #2
On Thu, Mar 21, 2024 at 06:07:50PM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
>   pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
> 
> rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
> v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
> v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
> 
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
> 
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset().  It'll be ideal if that API can be completely dropped
> from arch hugetlb API.  This series is one small step towards merging
> hugetlb specific codes into generic mm paths.  From that POV, this series
> removes one reference to huge_pte_offset() out of many others.

This remark would be a little easier to understand if you refer to
hugetlb_walk() not huge_pte_offset() - the latter is only used to
implement hugetlb_walk() and isn't removed by this series, while a
single hugetlb_walk() was removed.

Regardless, I think the point is spot on, the direction should be to
make the page table reading generic with minimal/no interaction with
hugetlbfs in the generic code.

After this series I would suggest doing the pagewalk.c stuff as it is
very parallel to GUP slow (indeed it would be amazing to figure out a
way to make GUP slow and pagewalk.c use the same code without a
performance cost)

Some of the other core mm callers don't look too bad either, getting
to the point where hugetlb_walk() is internal to hugetlb.c would be a
nice step that looks reasonable size.

> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM).  It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain.  With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).

Yeah, if all the special hugetlb stuff is using generic arch code and
generic page walkers (maybe with that special vma locking hook) it is
much easier to swallow than adding yet another special class of code
to all the page walkers.

> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup.  Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future.  That question is
> yet unsettled from Power side to have an acknowledgement.  

At a minimum, I think we have a concurrence that the reading of the
hugepd entries should be fine through the standard contig pte APIs and
so all the page walkers doing read side operations could stop having
special hugepd code. It is just an implementation of contig pte with
the new property that the size of a PTE can be larger than a PMD
entry.

If, or how much, the hugepd write side remains special is the open
question, I think.

> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd.  The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused.  It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.

Yeah, I think that is right for this series. Lets get this done and
then try to remove hugepd read side.

Thanks,
Jason
Christophe Leroy March 25, 2024, 2:56 p.m. UTC | #3
Le 21/03/2024 à 23:07, peterx@redhat.com a écrit :
> From: Peter Xu <peterx@redhat.com>
> 
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
>    pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
> 
> rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
> v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
> v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
> 
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
> 
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset().  It'll be ideal if that API can be completely dropped
> from arch hugetlb API.  This series is one small step towards merging
> hugetlb specific codes into generic mm paths.  From that POV, this series
> removes one reference to huge_pte_offset() out of many others.
> 
> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM).  It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain.  With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).
> 
> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup.  Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future.  That question is
> yet unsettled from Power side to have an acknowledgement.  As of now for
> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd.  The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused.  It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.
> 

It is not true that hugepd is a software-only solution. Powerpc 8xx HW 
matches the hugepd topology for 8M pages.

Christophe
Peter Xu March 25, 2024, 6:58 p.m. UTC | #4
On Fri, Mar 22, 2024 at 01:10:00PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 06:07:50PM -0400, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > v3:
> > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
> >   pXd_huge() users) with pXd_leaf() [Jason]
> > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> > - Use IS_ENABLED() in follow_huge_pud() [Jason]
> > - Remove redundant none pud check in follow_pud_mask() [Jason]
> > 
> > rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
> > v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
> > v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
> > 
> > The series removes the hugetlb slow gup path after a previous refactor work
> > [1], so that slow gup now uses the exact same path to process all kinds of
> > memory including hugetlb.
> > 
> > For the long term, we may want to remove most, if not all, call sites of
> > huge_pte_offset().  It'll be ideal if that API can be completely dropped
> > from arch hugetlb API.  This series is one small step towards merging
> > hugetlb specific codes into generic mm paths.  From that POV, this series
> > removes one reference to huge_pte_offset() out of many others.
> 
> This remark would be a little easier to understand if you refer to
> hugetlb_walk() not huge_pte_offset() - the latter is only used to
> implement hugetlb_walk() and isn't removed by this series, while a
> single hugetlb_walk() was removed.

Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
the hugetlb_walk() is simply the wrapper.

> 
> Regardless, I think the point is spot on, the direction should be to
> make the page table reading generic with minimal/no interaction with
> hugetlbfs in the generic code.

Yes, and I also like your terms on calling them "pgtable readers".  It's a
better way to describe the difference in that regard between
huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
should start with the "readers".

The writters might change semantics when merge, and can be more
challenging, I'll need to look into details of each one, like page fault
handlers.  Such work may need to be analyzed case by case, and this GUP
part is definitely the low hanging fruit comparing to the rest.

> 
> After this series I would suggest doing the pagewalk.c stuff as it is
> very parallel to GUP slow (indeed it would be amazing to figure out a
> way to make GUP slow and pagewalk.c use the same code without a
> performance cost)

Yes.  I hope there shouldn't be much perf degrade, I can do some more
measurements too when getting there.  And btw, IIUC the major challenge of
pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
be that hard if that's the solo purpose.  The better way to go is to remove
mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
callers; that's "the challenge".. pretty much labor works, not a major
technical challenge it seems.  Not sure if it's a good news or bad..

One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
seems to be good for all such purposes; well, I may need to bite the bullet
here.. for either of the purposes to move on.

> 
> Some of the other core mm callers don't look too bad either, getting
> to the point where hugetlb_walk() is internal to hugetlb.c would be a
> nice step that looks reasonable size.

Agree.

> 
> > One goal of such a route is that we can reconsider merging hugetlb features
> > like High Granularity Mapping (HGM).  It was not accepted in the past
> > because it may add lots of hugetlb specific codes and make the mm code even
> > harder to maintain.  With a merged codeset, features like HGM can hopefully
> > share some code with THP, legacy (PMD+) or modern (continuous PTEs).
> 
> Yeah, if all the special hugetlb stuff is using generic arch code and
> generic page walkers (maybe with that special vma locking hook) it is
> much easier to swallow than adding yet another special class of code
> to all the page walkers.
> 
> > To make it work, the generic slow gup code will need to at least understand
> > hugepd, which is already done like so in fast-gup.  Due to the specialty of
> > hugepd to be software-only solution (no hardware recognizes the hugepd
> > format, so it's purely artificial structures), there's chance we can merge
> > some or all hugepd formats with cont_pte in the future.  That question is
> > yet unsettled from Power side to have an acknowledgement.  
> 
> At a minimum, I think we have a concurrence that the reading of the
> hugepd entries should be fine through the standard contig pte APIs and
> so all the page walkers doing read side operations could stop having
> special hugepd code. It is just an implementation of contig pte with
> the new property that the size of a PTE can be larger than a PMD
> entry.
> 
> If, or how much, the hugepd write side remains special is the open
> question, I think.

It seems balls are rolling in that aspect, I haven't looked in depth there
in Christophe's series but it's great to have started!

> 
> > this series, I kept the hugepd handling because we may still need to do so
> > before getting a clearer picture of the future of hugepd.  The other reason
> > is simply that we did it already for fast-gup and most codes are still
> > around to be reused.  It'll make more sense to keep slow/fast gup behave
> > the same before a decision is made to remove hugepd.
> 
> Yeah, I think that is right for this series. Lets get this done and
> then try to remove hugepd read side.

Thanks a bunch for the reviews.
Jason Gunthorpe March 26, 2024, 2:02 p.m. UTC | #5
On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote:

> > This remark would be a little easier to understand if you refer to
> > hugetlb_walk() not huge_pte_offset() - the latter is only used to
> > implement hugetlb_walk() and isn't removed by this series, while a
> > single hugetlb_walk() was removed.
> 
> Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
> the hugetlb_walk() is simply the wrapper.

But arguably hugetlb_walk is the thing that should go..

In the generic code we should really try to get away from the weird
hugetlb abuse of treating every level as a pte_t. That is not how the
generic page table API works and that weirdness is part of what
motivates the arch API to supply special functions for reading. Not
every arch can just cast every level to a pte_t and still work.

But that weirdness is also saving alot of code so something else needs
to be though up..

> > Regardless, I think the point is spot on, the direction should be to
> > make the page table reading generic with minimal/no interaction with
> > hugetlbfs in the generic code.
> 
> Yes, and I also like your terms on calling them "pgtable readers".  It's a
> better way to describe the difference in that regard between
> huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
> should start with the "readers".

Yeah, it makes alot of sense to tackle the readers first - we are
pretty close now to having enough done to have generic readers. I
would imagine tackling everything outside mm/huge*.c to use the normal
page table API for reading.

Then consider what to do with the reading in mm/huge*.c
 
> The writters might change semantics when merge, and can be more
> challenging, I'll need to look into details of each one, like page fault
> handlers.  Such work may need to be analyzed case by case, and this GUP
> part is definitely the low hanging fruit comparing to the rest.

The write side is tricky, I think if the read side is sorted out then
it will be easer to reason about the write side. Today the write side
is paired with the special read side and it is extra hard to
understand if there is something weird hidden in the arch.
 
> measurements too when getting there.  And btw, IIUC the major challenge of
> pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
> be that hard if that's the solo purpose.  The better way to go is to remove
> mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
> callers; that's "the challenge".. pretty much labor works, not a major
> technical challenge it seems.  Not sure if it's a good news or bad..

Ugh, that is a big pain. It is relying on that hugetlbfs trick of
passing in a pte that is not a pte to make the API generic..

The more I look at this the more I think we need to get to Matthew's
idea of having some kind of generic page table API that is not tightly
tied to level. Replacing the hugetlb trick of 'everything is a PTE'
with 5 special cases in every place seems just horrible.

   struct mm_walk_ops {
       int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
   }

And many cases really want something like:
   struct mm_walk_state state;

   if (!mm_walk_seek_leaf(state, mm, address))
          goto no_present
   if (mm_walk_is_write(state)) ..

And detailed walking:
   for_each_pt_leaf(state, mm, address) {
       if (mm_walk_is_write(state)) ..
   }

Replacing it with a mm_walk_state that retains the level or otherwise
to allow decoding any entry composes a lot better. Forced Loop
unrolling can get back to the current code gen in alot of places.

It also makes the power stuff a bit nicer as the mm_walk_state could
automatically retain back pointers to the higher levels in the state
struct too...

The puzzle is how to do it and still get reasonable efficient codegen,
many operations are going to end up switching on some state->level to
know how to decode the entry.

> One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
> one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
> seems to be good for all such purposes; well, I may need to bite the bullet
> here.. for either of the purposes to move on.

That would be a nice feature!

> > If, or how much, the hugepd write side remains special is the open
> > question, I think.
> 
> It seems balls are rolling in that aspect, I haven't looked in depth there
> in Christophe's series but it's great to have started!

Yes!

Jason
Peter Xu April 4, 2024, 9:48 p.m. UTC | #6
On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> The more I look at this the more I think we need to get to Matthew's
> idea of having some kind of generic page table API that is not tightly
> tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> with 5 special cases in every place seems just horrible.
> 
>    struct mm_walk_ops {
>        int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
>    }
> 
> And many cases really want something like:
>    struct mm_walk_state state;
> 
>    if (!mm_walk_seek_leaf(state, mm, address))
>           goto no_present
>    if (mm_walk_is_write(state)) ..
> 
> And detailed walking:
>    for_each_pt_leaf(state, mm, address) {
>        if (mm_walk_is_write(state)) ..
>    }
> 
> Replacing it with a mm_walk_state that retains the level or otherwise
> to allow decoding any entry composes a lot better. Forced Loop
> unrolling can get back to the current code gen in alot of places.
> 
> It also makes the power stuff a bit nicer as the mm_walk_state could
> automatically retain back pointers to the higher levels in the state
> struct too...
> 
> The puzzle is how to do it and still get reasonable efficient codegen,
> many operations are going to end up switching on some state->level to
> know how to decode the entry.

These discussions are definitely constructive, thanks Jason.  Very helpful.

I thought about this last week but got interrupted.  It does make sense to
me; it looks pretty generic and it is flexible enough as a top design.  At
least that's what I thought.

However now when I rethink about it, and look more into the code when I got
the chance, it turns out this will be a major rewrite of mostly every
walkers..  it doesn't mean that this is a bad idea, but then I'll need to
compare the other approach, because there can be a huge difference on when
we can get that code ready, I think. :)

Consider that what we (or.. I) want to teach the pXd layers are two things
right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
shares the generic concept when working on the mm walkers no matter which
way to go, just different treatment on different type of mem.  (2) is on
top of current code and new stuff, while (1) is a refactoring to drop
hugetlb_entry() hook point as the goal.

Taking a simplest mm walker (smaps) as example, I think most codes are
ready thanks to THP's existance, and also like vm_normal_page[_pmd]() which
should even already work for pfnmaps; pud layer is missing but that should
be trivial.  It means we may have chance to drop hugetlb_entry() without an
huge overhaul yet.

Now the important question I'm asking myself is: do we really need huge p4d
or even bigger?  It's 512GB on x86, and we said "No 512 GiB pages yet"
(commit fe1e8c3e963) since 2017 - that is 7 years without chaning this
fact.  While on non-x86 p4d_leaf() never defined.  Then it's also
interesting to see how many codes are "ready" to handle p4d entries (by
looking at p4d_leaf() calls; much easier to see with the removal of the
rest huge apis..) even if none existed.

So, can we over-engineer too much if we go the generic route now?
Considering that we already have most of pmd/pud entries around in the mm
walker ops.  So far it sounds better we leave it for later, until further
justifed to be useful.  And that won't block it if it ever justified to be
needed, I'd say it can also be seen as a step forward if I can make it to
remove hugetlb_entry() first.

Comments welcomed (before I start to work on anything..).

Thanks,
Jason Gunthorpe April 5, 2024, 6:16 p.m. UTC | #7
On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > The more I look at this the more I think we need to get to Matthew's
> > idea of having some kind of generic page table API that is not tightly
> > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > with 5 special cases in every place seems just horrible.
> > 
> >    struct mm_walk_ops {
> >        int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> >    }
> > 
> > And many cases really want something like:
> >    struct mm_walk_state state;
> > 
> >    if (!mm_walk_seek_leaf(state, mm, address))
> >           goto no_present
> >    if (mm_walk_is_write(state)) ..
> > 
> > And detailed walking:
> >    for_each_pt_leaf(state, mm, address) {
> >        if (mm_walk_is_write(state)) ..
> >    }
> > 
> > Replacing it with a mm_walk_state that retains the level or otherwise
> > to allow decoding any entry composes a lot better. Forced Loop
> > unrolling can get back to the current code gen in alot of places.
> > 
> > It also makes the power stuff a bit nicer as the mm_walk_state could
> > automatically retain back pointers to the higher levels in the state
> > struct too...
> > 
> > The puzzle is how to do it and still get reasonable efficient codegen,
> > many operations are going to end up switching on some state->level to
> > know how to decode the entry.
> 
> These discussions are definitely constructive, thanks Jason.  Very helpful.
> 
> I thought about this last week but got interrupted.  It does make sense to
> me; it looks pretty generic and it is flexible enough as a top design.  At
> least that's what I thought.

Yeah, exactly..

> However now when I rethink about it, and look more into the code when I got
> the chance, it turns out this will be a major rewrite of mostly every
> walkers..  

Indeed, it is why it may not be reasonable.

> Consider that what we (or.. I) want to teach the pXd layers are two things
> right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
> shares the generic concept when working on the mm walkers no matter which
> way to go, just different treatment on different type of mem.  (2) is on
> top of current code and new stuff, while (1) is a refactoring to drop
> hugetlb_entry() hook point as the goal.

Right, I view this as a two pronged attack

One one front you teach the generic pXX_* macros to process huge pages
and push that around to the performance walkers like GUP

On another front you want to replace use of the hugepte with the new
walkers.

The challenge with the hugepte code is that it is all structured to
assume one API that works at all levels and that may be a hard fit to
replace with pXX_* functions.

The places that are easy to switch from hugetlb to pXX_* may as well
do so.

Other places maybe need a hugetlb replacement that has a similar
abstraction level of pointing to any page table level.

I think if you do the easy places for pXX conversion you will have a
good idea about what is needed for the hard places.

> Now the important question I'm asking myself is: do we really need huge p4d
> or even bigger?

Do we need huge p4d support with folios? Probably not..

huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly
interesting - but I wouldn't ask anyone to do the work :)

But then again we come back to power and its big list of page sizes
and variety :( Looks like some there have huge sizes at the pgd level
at least.

> So, can we over-engineer too much if we go the generic route now?

Yes we can, and it will probably be slow as well. The pXX macros are
the most efficient if code can be adapted to use them.

> Considering that we already have most of pmd/pud entries around in the mm
> walker ops.

Yeah, so you add pgd and maybe p4d and then we can don't need any
generic thing. If it is easy it would be nice.

Jason
Peter Xu April 5, 2024, 9:42 p.m. UTC | #8
On Fri, Apr 05, 2024 at 03:16:33PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> > On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > > The more I look at this the more I think we need to get to Matthew's
> > > idea of having some kind of generic page table API that is not tightly
> > > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > > with 5 special cases in every place seems just horrible.
> > > 
> > >    struct mm_walk_ops {
> > >        int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> > >    }
> > > 
> > > And many cases really want something like:
> > >    struct mm_walk_state state;
> > > 
> > >    if (!mm_walk_seek_leaf(state, mm, address))
> > >           goto no_present
> > >    if (mm_walk_is_write(state)) ..
> > > 
> > > And detailed walking:
> > >    for_each_pt_leaf(state, mm, address) {
> > >        if (mm_walk_is_write(state)) ..
> > >    }
> > > 
> > > Replacing it with a mm_walk_state that retains the level or otherwise
> > > to allow decoding any entry composes a lot better. Forced Loop
> > > unrolling can get back to the current code gen in alot of places.
> > > 
> > > It also makes the power stuff a bit nicer as the mm_walk_state could
> > > automatically retain back pointers to the higher levels in the state
> > > struct too...
> > > 
> > > The puzzle is how to do it and still get reasonable efficient codegen,
> > > many operations are going to end up switching on some state->level to
> > > know how to decode the entry.
> > 
> > These discussions are definitely constructive, thanks Jason.  Very helpful.
> > 
> > I thought about this last week but got interrupted.  It does make sense to
> > me; it looks pretty generic and it is flexible enough as a top design.  At
> > least that's what I thought.
> 
> Yeah, exactly..
> 
> > However now when I rethink about it, and look more into the code when I got
> > the chance, it turns out this will be a major rewrite of mostly every
> > walkers..  
> 
> Indeed, it is why it may not be reasonable.
> 
> > Consider that what we (or.. I) want to teach the pXd layers are two things
> > right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
> > shares the generic concept when working on the mm walkers no matter which
> > way to go, just different treatment on different type of mem.  (2) is on
> > top of current code and new stuff, while (1) is a refactoring to drop
> > hugetlb_entry() hook point as the goal.
> 
> Right, I view this as a two pronged attack
> 
> One one front you teach the generic pXX_* macros to process huge pages
> and push that around to the performance walkers like GUP
> 
> On another front you want to replace use of the hugepte with the new
> walkers.
> 
> The challenge with the hugepte code is that it is all structured to
> assume one API that works at all levels and that may be a hard fit to
> replace with pXX_* functions.

That's the core of problem, or otherwise I feel like I might be doing
something else already.  I had a feeling even if it's currently efficient
for hugetlb, we'll drop that sooner or later.

The issue is at least hugetlb pgtable format is exactly the same as the
rest, as large folio grows it will reach the point that we complain more
than before on having hugetlb does its smart things on its own.

> 
> The places that are easy to switch from hugetlb to pXX_* may as well
> do so.
> 
> Other places maybe need a hugetlb replacement that has a similar
> abstraction level of pointing to any page table level.

IMHO this depends.

Per my current knowledge, hugetlb is only special in three forms:

- huge mapping (well, this isn't that special..)
- cont pte/pmd/...
- hugepd

The most fancy one is actually hugepd.. but I plan to put that temporarily
aside - I haven't look at Christophe's series yet, however I think we're
going to solve orthogonal issues but his work will definitely help me on
reaching mine, and I want to think through first on my end of things to
know a plan.  If hugepd has its chance to be generalized, the worst case is
I'll leverage CONFIG_ARCH_HAS_HUGEPD and only keep hugetlb_entry() for them
until hugepd became some cont-pxx variance.  Then when I put HAS_HUGEPD
aside, I don't think it's very complicated, perhaps?

In short, hugetlb mappings shouldn't be special comparing to other huge pXd
and large folio (cont-pXd) mappings for most of the walkers in my mind, if
not all.  I need to look at all the walkers and there can be some tricky
ones, but I believe that applies in general.  It's actually similar to what
I did with slow gup here.

Like this series, for cont-pXd we'll need multiple walks comparing to
before (when with hugetlb_entry()), but for that part I'll provide some
performance tests too, and we also have a fallback plan, which is to detect
cont-pXd existance, which will also work for large folios.

> 
> I think if you do the easy places for pXX conversion you will have a
> good idea about what is needed for the hard places.

Here IMHO we don't need to understand "what is the size of this hugetlb
vma", or "which level of pgtable does this hugetlb vma pages locate",
because we may not need that, e.g., when we only want to collect some smaps
statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
hugetlb_entry removed), we may need extra check later to put things into
the right bucket, but for the walker itself it doesn't necessarily need
hugetlb_entry().

> 
> > Now the important question I'm asking myself is: do we really need huge p4d
> > or even bigger?
> 
> Do we need huge p4d support with folios? Probably not..
> 
> huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly
> interesting - but I wouldn't ask anyone to do the work :)

Considering it does not yet exist, and we don't have plan to work it out
(so far), maybe I can see this as a very implicit ack that I can put that
aside at least of now. :)

> 
> But then again we come back to power and its big list of page sizes
> and variety :( Looks like some there have huge sizes at the pgd level
> at least.

Yeah this is something I want to be super clear, because I may miss
something: we don't have real pgd pages, right?  Powerpc doesn't even
define p4d_leaf(), AFAICT.

$ git grep p4d_leaf
arch/powerpc/mm/book3s64/radix_pgtable.c:               if (p4d_leaf(*p4d)) {
arch/powerpc/mm/pgtable.c:      if (p4d_leaf(p4d)) {
arch/powerpc/mm/pgtable_64.c:   if (p4d_leaf(p4d)) {
arch/powerpc/mm/pgtable_64.c:                   VM_WARN_ON(!p4d_leaf(p4d));
arch/powerpc/xmon/xmon.c:       if (p4d_leaf(*p4dp)) {

They all constantly return false (which is the fallback)?

> 
> > So, can we over-engineer too much if we go the generic route now?
> 
> Yes we can, and it will probably be slow as well. The pXX macros are
> the most efficient if code can be adapted to use them.
> 
> > Considering that we already have most of pmd/pud entries around in the mm
> > walker ops.
> 
> Yeah, so you add pgd and maybe p4d and then we can don't need any
> generic thing. If it is easy it would be nice.

I want to double check on above on PowerPC first, otherwise I'd consider
leaving p4d+ alone, if that looks ok.  It could be that I missed something
important, Christophe may want to chim in if he has any thoughts or just to
clarify my mistakes.

Thanks,
Jason Gunthorpe April 9, 2024, 11:43 p.m. UTC | #9
On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> not all.  I need to look at all the walkers and there can be some tricky
> ones, but I believe that applies in general.  It's actually similar to what
> I did with slow gup here.

I think that is the big question, I also haven't done the research to
know the answer.

At this point focusing on moving what is reasonable to the pXX_* API
makes sense to me. Then reviewing what remains and making some
decision.

> Like this series, for cont-pXd we'll need multiple walks comparing to
> before (when with hugetlb_entry()), but for that part I'll provide some
> performance tests too, and we also have a fallback plan, which is to detect
> cont-pXd existance, which will also work for large folios.

I think we can optimize this pretty easy.
 
> > I think if you do the easy places for pXX conversion you will have a
> > good idea about what is needed for the hard places.
> 
> Here IMHO we don't need to understand "what is the size of this hugetlb
> vma"

Yeh, I never really understood why hugetlb was linked to the VMA.. The
page table is self describing, obviously.

> or "which level of pgtable does this hugetlb vma pages locate",

Ditto

> because we may not need that, e.g., when we only want to collect some smaps
> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> hugetlb_entry removed), we may need extra check later to put things into
> the right bucket, but for the walker itself it doesn't necessarily need
> hugetlb_entry().

Right, places may still need to know it is part of a huge VMA because we
have special stuff linked to that.

> > But then again we come back to power and its big list of page sizes
> > and variety :( Looks like some there have huge sizes at the pgd level
> > at least.
> 
> Yeah this is something I want to be super clear, because I may miss
> something: we don't have real pgd pages, right?  Powerpc doesn't even
> define p4d_leaf(), AFAICT.

AFAICT it is because it hides it all in hugepd.

If the goal is to purge hugepd then some of the options might turn out
to convert hugepd into huge p4d/pgd, as I understand it. It would be
nice to have certainty on this at least.

We have effectively three APIs to parse a single page table and
currently none of the APIs can return 100% of the data for power.

Jason
Peter Xu April 10, 2024, 3:28 p.m. UTC | #10
On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> > In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> > and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> > not all.  I need to look at all the walkers and there can be some tricky
> > ones, but I believe that applies in general.  It's actually similar to what
> > I did with slow gup here.
> 
> I think that is the big question, I also haven't done the research to
> know the answer.
> 
> At this point focusing on moving what is reasonable to the pXX_* API
> makes sense to me. Then reviewing what remains and making some
> decision.
> 
> > Like this series, for cont-pXd we'll need multiple walks comparing to
> > before (when with hugetlb_entry()), but for that part I'll provide some
> > performance tests too, and we also have a fallback plan, which is to detect
> > cont-pXd existance, which will also work for large folios.
> 
> I think we can optimize this pretty easy.
>  
> > > I think if you do the easy places for pXX conversion you will have a
> > > good idea about what is needed for the hard places.
> > 
> > Here IMHO we don't need to understand "what is the size of this hugetlb
> > vma"
> 
> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> page table is self describing, obviously.

Attaching to vma still makes sense to me, where we should definitely avoid
a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
allocated, managed, ...  totally differently.

And since hugetlb is designed as file-based (which also makes sense to me,
at least for now), it's also natural that it's vma-attached.

> 
> > or "which level of pgtable does this hugetlb vma pages locate",
> 
> Ditto
> 
> > because we may not need that, e.g., when we only want to collect some smaps
> > statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> > walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> > hugetlb_entry removed), we may need extra check later to put things into
> > the right bucket, but for the walker itself it doesn't necessarily need
> > hugetlb_entry().
> 
> Right, places may still need to know it is part of a huge VMA because we
> have special stuff linked to that.
> 
> > > But then again we come back to power and its big list of page sizes
> > > and variety :( Looks like some there have huge sizes at the pgd level
> > > at least.
> > 
> > Yeah this is something I want to be super clear, because I may miss
> > something: we don't have real pgd pages, right?  Powerpc doesn't even
> > define p4d_leaf(), AFAICT.
> 
> AFAICT it is because it hides it all in hugepd.

IMHO one thing we can benefit from such hugepd rework is, if we can squash
all the hugepds like what Christophe does, then we push it one more layer
down, and we have a good chance all things should just work.

So again my Power brain is close to zero, but now I'm referring to what
Christophe shared in the other thread:

https://github.com/linuxppc/wiki/wiki/Huge-pages

Together with:

https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu

Where it has:

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
        select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
        select ARCH_ENABLE_SPLIT_PMD_PTLOCK
        select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
        select ARCH_SUPPORTS_HUGETLBFS
        select ARCH_SUPPORTS_NUMA_BALANCING
        select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
 config PPC_E500
        select FSL_EMB_PERFMON
        bool
+       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
        select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
        select PPC_SMP_MUXED_IPI
        select PPC_DOORBELL

So I think it means we have three PowerPC systems that supports hugepd
right now (besides the 8xx which Christophe is trying to drop support
there), besides 8xx we still have book3s_64 and E500.

Let's check one by one:

  - book3s_64

    - hash

      - 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
        means after squashing it'll be a bunch of cont-pmd, all good.

      - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
        it'll be cont-pud. all good.

    - radix

      - 64K: largest 1G @pud, then cont-pmd after squashed. all good.

      - 4K: largest 1G @pud, then cont-pmd, all good.

  - e500 & 8xx

    - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
      @pgd level they become cont-pte. all good.

I think the trick here is there'll be no pgd leaves after hugepd squashing
to lower levels, then since PowerPC seems to never have p4d, then all
things fall into pud or lower.  We seem to be all good there?

> 
> If the goal is to purge hugepd then some of the options might turn out
> to convert hugepd into huge p4d/pgd, as I understand it. It would be
> nice to have certainty on this at least.

Right.  I hope the pmd/pud plan I proposed above can already work too with
such ambicious goal too.  But review very welcomed from either you or
Christophe.

PS: I think I'll also have a closer look at Christophe's series this week
or next.

> 
> We have effectively three APIs to parse a single page table and
> currently none of the APIs can return 100% of the data for power.

Thanks,
Christophe Leroy April 10, 2024, 4:30 p.m. UTC | #11
Le 10/04/2024 à 17:28, Peter Xu a écrit :
> On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
>> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
>>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
>>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
>>> not all.  I need to look at all the walkers and there can be some tricky
>>> ones, but I believe that applies in general.  It's actually similar to what
>>> I did with slow gup here.
>>
>> I think that is the big question, I also haven't done the research to
>> know the answer.
>>
>> At this point focusing on moving what is reasonable to the pXX_* API
>> makes sense to me. Then reviewing what remains and making some
>> decision.
>>
>>> Like this series, for cont-pXd we'll need multiple walks comparing to
>>> before (when with hugetlb_entry()), but for that part I'll provide some
>>> performance tests too, and we also have a fallback plan, which is to detect
>>> cont-pXd existance, which will also work for large folios.
>>
>> I think we can optimize this pretty easy.
>>   
>>>> I think if you do the easy places for pXX conversion you will have a
>>>> good idea about what is needed for the hard places.
>>>
>>> Here IMHO we don't need to understand "what is the size of this hugetlb
>>> vma"
>>
>> Yeh, I never really understood why hugetlb was linked to the VMA.. The
>> page table is self describing, obviously.
> 
> Attaching to vma still makes sense to me, where we should definitely avoid
> a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> allocated, managed, ...  totally differently.
> 
> And since hugetlb is designed as file-based (which also makes sense to me,
> at least for now), it's also natural that it's vma-attached.
> 
>>
>>> or "which level of pgtable does this hugetlb vma pages locate",
>>
>> Ditto
>>
>>> because we may not need that, e.g., when we only want to collect some smaps
>>> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
>>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
>>> hugetlb_entry removed), we may need extra check later to put things into
>>> the right bucket, but for the walker itself it doesn't necessarily need
>>> hugetlb_entry().
>>
>> Right, places may still need to know it is part of a huge VMA because we
>> have special stuff linked to that.
>>
>>>> But then again we come back to power and its big list of page sizes
>>>> and variety :( Looks like some there have huge sizes at the pgd level
>>>> at least.
>>>
>>> Yeah this is something I want to be super clear, because I may miss
>>> something: we don't have real pgd pages, right?  Powerpc doesn't even
>>> define p4d_leaf(), AFAICT.
>>
>> AFAICT it is because it hides it all in hugepd.
> 
> IMHO one thing we can benefit from such hugepd rework is, if we can squash
> all the hugepds like what Christophe does, then we push it one more layer
> down, and we have a good chance all things should just work.
> 
> So again my Power brain is close to zero, but now I'm referring to what
> Christophe shared in the other thread:
> 
> https://github.com/linuxppc/wiki/wiki/Huge-pages
> 
> Together with:
> 
> https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
> 
> Where it has:
> 
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
>          select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>          select ARCH_ENABLE_SPLIT_PMD_PTLOCK
>          select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>          select ARCH_SUPPORTS_HUGETLBFS
>          select ARCH_SUPPORTS_NUMA_BALANCING
>          select HAVE_MOVE_PMD
> @@ -290,6 +291,7 @@ config PPC_BOOK3S
>   config PPC_E500
>          select FSL_EMB_PERFMON
>          bool
> +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>          select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
>          select PPC_SMP_MUXED_IPI
>          select PPC_DOORBELL
> 
> So I think it means we have three PowerPC systems that supports hugepd
> right now (besides the 8xx which Christophe is trying to drop support
> there), besides 8xx we still have book3s_64 and E500.
> 
> Let's check one by one:
> 
>    - book3s_64
> 
>      - hash
> 
>        - 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
>          means after squashing it'll be a bunch of cont-pmd, all good.
> 
>        - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
>          it'll be cont-pud. all good.
> 
>      - radix
> 
>        - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> 
>        - 4K: largest 1G @pud, then cont-pmd, all good.
> 
>    - e500 & 8xx
> 
>      - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
>        @pgd level they become cont-pte. all good.

e500 has two modes: 32 bits and 64 bits.

For 32 bits:

8xx is the only one handling it through HW-assisted pagetable walk hence 
requiring a 2-level whatever the pagesize is.

On e500 it is all software so pages 2M and larger should be cont-PGD (by 
the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD 
and PTE, the PGD entries are populated by a function called PMD_populate()).

Current situation for 8xx is illustrated here: 
https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx

I also tried to better illustrate e500/32 here: 
https://github.com/linuxppc/wiki/wiki/Huge-pages#e500

For 64 bits:
We have PTE/PMD/PUD/PGD, no P4D

See arch/powerpc/include/asm/nohash/64/pgtable-4k.h


> 
> I think the trick here is there'll be no pgd leaves after hugepd squashing
> to lower levels, then since PowerPC seems to never have p4d, then all
> things fall into pud or lower.  We seem to be all good there?
> 
>>
>> If the goal is to purge hugepd then some of the options might turn out
>> to convert hugepd into huge p4d/pgd, as I understand it. It would be
>> nice to have certainty on this at least.
> 
> Right.  I hope the pmd/pud plan I proposed above can already work too with
> such ambicious goal too.  But review very welcomed from either you or
> Christophe.
> 
> PS: I think I'll also have a closer look at Christophe's series this week
> or next.
> 
>>
>> We have effectively three APIs to parse a single page table and
>> currently none of the APIs can return 100% of the data for power.
> 
> Thanks,
>
Peter Xu April 10, 2024, 7:58 p.m. UTC | #12
On Wed, Apr 10, 2024 at 04:30:41PM +0000, Christophe Leroy wrote:
> 
> 
> Le 10/04/2024 à 17:28, Peter Xu a écrit :
> > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> >>> not all.  I need to look at all the walkers and there can be some tricky
> >>> ones, but I believe that applies in general.  It's actually similar to what
> >>> I did with slow gup here.
> >>
> >> I think that is the big question, I also haven't done the research to
> >> know the answer.
> >>
> >> At this point focusing on moving what is reasonable to the pXX_* API
> >> makes sense to me. Then reviewing what remains and making some
> >> decision.
> >>
> >>> Like this series, for cont-pXd we'll need multiple walks comparing to
> >>> before (when with hugetlb_entry()), but for that part I'll provide some
> >>> performance tests too, and we also have a fallback plan, which is to detect
> >>> cont-pXd existance, which will also work for large folios.
> >>
> >> I think we can optimize this pretty easy.
> >>   
> >>>> I think if you do the easy places for pXX conversion you will have a
> >>>> good idea about what is needed for the hard places.
> >>>
> >>> Here IMHO we don't need to understand "what is the size of this hugetlb
> >>> vma"
> >>
> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> >> page table is self describing, obviously.
> > 
> > Attaching to vma still makes sense to me, where we should definitely avoid
> > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> > allocated, managed, ...  totally differently.
> > 
> > And since hugetlb is designed as file-based (which also makes sense to me,
> > at least for now), it's also natural that it's vma-attached.
> > 
> >>
> >>> or "which level of pgtable does this hugetlb vma pages locate",
> >>
> >> Ditto
> >>
> >>> because we may not need that, e.g., when we only want to collect some smaps
> >>> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> >>> hugetlb_entry removed), we may need extra check later to put things into
> >>> the right bucket, but for the walker itself it doesn't necessarily need
> >>> hugetlb_entry().
> >>
> >> Right, places may still need to know it is part of a huge VMA because we
> >> have special stuff linked to that.
> >>
> >>>> But then again we come back to power and its big list of page sizes
> >>>> and variety :( Looks like some there have huge sizes at the pgd level
> >>>> at least.
> >>>
> >>> Yeah this is something I want to be super clear, because I may miss
> >>> something: we don't have real pgd pages, right?  Powerpc doesn't even
> >>> define p4d_leaf(), AFAICT.
> >>
> >> AFAICT it is because it hides it all in hugepd.
> > 
> > IMHO one thing we can benefit from such hugepd rework is, if we can squash
> > all the hugepds like what Christophe does, then we push it one more layer
> > down, and we have a good chance all things should just work.
> > 
> > So again my Power brain is close to zero, but now I'm referring to what
> > Christophe shared in the other thread:
> > 
> > https://github.com/linuxppc/wiki/wiki/Huge-pages
> > 
> > Together with:
> > 
> > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
> > 
> > Where it has:
> > 
> > --- a/arch/powerpc/platforms/Kconfig.cputype
> > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
> >          select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> >          select ARCH_ENABLE_SPLIT_PMD_PTLOCK
> >          select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> > +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> >          select ARCH_SUPPORTS_HUGETLBFS
> >          select ARCH_SUPPORTS_NUMA_BALANCING
> >          select HAVE_MOVE_PMD
> > @@ -290,6 +291,7 @@ config PPC_BOOK3S
> >   config PPC_E500
> >          select FSL_EMB_PERFMON
> >          bool
> > +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> >          select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
> >          select PPC_SMP_MUXED_IPI
> >          select PPC_DOORBELL
> > 
> > So I think it means we have three PowerPC systems that supports hugepd
> > right now (besides the 8xx which Christophe is trying to drop support
> > there), besides 8xx we still have book3s_64 and E500.
> > 
> > Let's check one by one:
> > 
> >    - book3s_64
> > 
> >      - hash
> > 
> >        - 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
> >          means after squashing it'll be a bunch of cont-pmd, all good.
> > 
> >        - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
> >          it'll be cont-pud. all good.
> > 
> >      - radix
> > 
> >        - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> > 
> >        - 4K: largest 1G @pud, then cont-pmd, all good.
> > 
> >    - e500 & 8xx
> > 
> >      - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
> >        @pgd level they become cont-pte. all good.
> 
> e500 has two modes: 32 bits and 64 bits.
> 
> For 32 bits:
> 
> 8xx is the only one handling it through HW-assisted pagetable walk hence 
> requiring a 2-level whatever the pagesize is.

Hmm I think maybe finally I get it..

I think the confusion came from when I saw there's always such level-2
table described in Figure 8-5 of the manual:

https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf

So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
011b), then it won't ever read anything starting from "Level-2 Descriptor
1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
hugepd format must look like such for 8xx?

But then perhaps it's still compatible with cont-pte because the rest
entries (pte index 1+) will simply be ignored by the hardware?

> 
> On e500 it is all software so pages 2M and larger should be cont-PGD (by 
> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD 
> and PTE, the PGD entries are populated by a function called PMD_populate()).

Yeah.. I am also wondering whether pgd_populate() could also work there
(perhaps with some trivial changes, or maybe not even needed..), as when
p4d/pud/pmd levels are missing, linux should just do something like an
enforced cast from pgd_t* -> pmd_t* in this case.

I think currently they're already not pgd, as __find_linux_pte() already
skipped pgd unconditionally:

	pgdp = pgdir + pgd_index(ea);
	p4dp = p4d_offset(pgdp, ea);

> 
> Current situation for 8xx is illustrated here: 
> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
> 
> I also tried to better illustrate e500/32 here: 
> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
> 
> For 64 bits:
> We have PTE/PMD/PUD/PGD, no P4D
> 
> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h

We don't have anything that is above pud in this category, right?  That's
what I read from your wiki (and thanks for providing that in the first
place; helps a lot for me to understand how it works on PowerPC).

I want to make sure if I can move on without caring on p4d/pgd leafs like
what we do right now, even after if we can remove hugepd for good, in this
case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
can also cover the pgd ones when that day comes, iiuc.

Thanks,

> 
> 
> > 
> > I think the trick here is there'll be no pgd leaves after hugepd squashing
> > to lower levels, then since PowerPC seems to never have p4d, then all
> > things fall into pud or lower.  We seem to be all good there?
> > 
> >>
> >> If the goal is to purge hugepd then some of the options might turn out
> >> to convert hugepd into huge p4d/pgd, as I understand it. It would be
> >> nice to have certainty on this at least.
> > 
> > Right.  I hope the pmd/pud plan I proposed above can already work too with
> > such ambicious goal too.  But review very welcomed from either you or
> > Christophe.
> > 
> > PS: I think I'll also have a closer look at Christophe's series this week
> > or next.
> > 
> >>
> >> We have effectively three APIs to parse a single page table and
> >> currently none of the APIs can return 100% of the data for power.
> > 
> > Thanks,
> >
Christophe Leroy April 12, 2024, 2:27 p.m. UTC | #13
Le 10/04/2024 à 21:58, Peter Xu a écrit :
>>
>> e500 has two modes: 32 bits and 64 bits.
>>
>> For 32 bits:
>>
>> 8xx is the only one handling it through HW-assisted pagetable walk hence
>> requiring a 2-level whatever the pagesize is.
> 
> Hmm I think maybe finally I get it..
> 
> I think the confusion came from when I saw there's always such level-2
> table described in Figure 8-5 of the manual:
> 
> https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf

Yes indeed that figure is confusing.

Table 8-1 gives a pretty good idea of what is required. We only use 
MD_CTR[TWAM] = 1

> 
> So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
> aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
> 011b), then it won't ever read anything starting from "Level-2 Descriptor
> 1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
> hugepd format must look like such for 8xx?
> 
> But then perhaps it's still compatible with cont-pte because the rest
> entries (pte index 1+) will simply be ignored by the hardware?

Yes, still compatible with CONT-PTE allthough things become tricky 
because you need two page tables to get the full 8M so that's a kind of 
cont-PMD down to PTE level, as you can see in my RFC series.

> 
>>
>> On e500 it is all software so pages 2M and larger should be cont-PGD (by
>> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
>> and PTE, the PGD entries are populated by a function called PMD_populate()).
> 
> Yeah.. I am also wondering whether pgd_populate() could also work there
> (perhaps with some trivial changes, or maybe not even needed..), as when
> p4d/pud/pmd levels are missing, linux should just do something like an
> enforced cast from pgd_t* -> pmd_t* in this case.
> 
> I think currently they're already not pgd, as __find_linux_pte() already
> skipped pgd unconditionally:
> 
> 	pgdp = pgdir + pgd_index(ea);
> 	p4dp = p4d_offset(pgdp, ea);
> 

Yes that's what is confusing, some parts of code considers we have only 
a PGD and a PT while other parts consider we have only a PMD and a PT

>>
>> Current situation for 8xx is illustrated here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
>>
>> I also tried to better illustrate e500/32 here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
>>
>> For 64 bits:
>> We have PTE/PMD/PUD/PGD, no P4D
>>
>> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h
> 
> We don't have anything that is above pud in this category, right?  That's
> what I read from your wiki (and thanks for providing that in the first
> place; helps a lot for me to understand how it works on PowerPC).

Yes thanks to Michael and Aneesh who initiated that Wiki page.

> 
> I want to make sure if I can move on without caring on p4d/pgd leafs like
> what we do right now, even after if we can remove hugepd for good, in this
> case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
> can also cover the pgd ones when that day comes, iiuc.

I guess so but I'd like Aneesh and/or Michael to confirm as I'm not an 
expert on PPC64.

Christophe