mbox series

[00/13] mm/treewide: Remove pXd_huge() API

Message ID 20240313214719.253873-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm/treewide: Remove pXd_huge() API | expand

Message

Peter Xu March 13, 2024, 9:47 p.m. UTC
From: Peter Xu <peterx@redhat.com>

[based on akpm/mm-unstable latest commit 9af2e4c429b5]

v1:
- Rebase, remove RFC tag
- Fixed powerpc patch build issue, enhancing commit message [Michael]
- Optimize patch 1 & 3 on "none || !present" check [Jason]

In previous work [1], we removed the pXd_large() API, which is arch
specific.  This patchset further removes the hugetlb pXd_huge() API.

Hugetlb was never special on creating huge mappings when compared with
other huge mappings.  Having a standalone API just to detect such pgtable
entries is more or less redundant, especially after the pXd_leaf() API set
is introduced with/without CONFIG_HUGETLB_PAGE.

When looking at this problem, a few issues are also exposed that we don't
have a clear definition of the *_huge() variance API.  This patchset
started by cleaning these issues first, then replace all *_huge() users to
use *_leaf(), then drop all *_huge() code.

On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for
all the rest archs they're reported "false" instead.  This part is done in
patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll
leave that to hmm experts to decide.

Besides, there are three archs (arm, arm64, powerpc) that have slightly
different definitions between the *_huge() v.s. *_leaf() variances.  I
tackled them separately so that it'll be easier for arch experts to chim in
when necessary.  This part is done in patch 6-9.

The final patches 10-13 do the rest on the final removal, since *_leaf()
will be the ultimate API in the future, and we seem to have quite some
confusions on how *_huge() APIs can be defined, provide a rich comment for
*_leaf() API set to define them properly to avoid future misuse, and
hopefully that'll also help new archs to start support huge mappings and
avoid traps (like either swap entries, or PROT_NONE entry checks).

The whole series is only lightly tested on x86, while as usual I don't have
the capability to test all archs that it touches.

[1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com

Peter Xu (13):
  mm/hmm: Process pud swap entry without pud_huge()
  mm/gup: Cache p4d in follow_p4d_mask()
  mm/gup: Check p4d presence before going on
  mm/x86: Change pXd_huge() behavior to exclude swap entries
  mm/sparc: Change pXd_huge() behavior to exclude swap entries
  mm/arm: Use macros to define pmd/pud helpers
  mm/arm: Redefine pmd_huge() with pmd_leaf()
  mm/arm64: Merge pXd_huge() and pXd_leaf() definitions
  mm/powerpc: Redefine pXd_huge() with pXd_leaf()
  mm/gup: Merge pXd huge mapping checks
  mm/treewide: Replace pXd_huge() with pXd_leaf()
  mm/treewide: Remove pXd_huge()
  mm: Document pXd_leaf() API

 arch/arm/include/asm/pgtable-2level.h         |  4 +--
 arch/arm/include/asm/pgtable-3level-hwdef.h   |  1 +
 arch/arm/include/asm/pgtable-3level.h         |  6 ++--
 arch/arm/mm/Makefile                          |  1 -
 arch/arm/mm/hugetlbpage.c                     | 34 -------------------
 arch/arm64/include/asm/pgtable.h              |  6 +++-
 arch/arm64/mm/hugetlbpage.c                   | 18 ++--------
 arch/loongarch/mm/hugetlbpage.c               | 12 +------
 arch/mips/include/asm/pgtable-32.h            |  2 +-
 arch/mips/include/asm/pgtable-64.h            |  2 +-
 arch/mips/mm/hugetlbpage.c                    | 10 ------
 arch/mips/mm/tlb-r4k.c                        |  2 +-
 arch/parisc/mm/hugetlbpage.c                  | 11 ------
 .../include/asm/book3s/64/pgtable-4k.h        | 20 -----------
 .../include/asm/book3s/64/pgtable-64k.h       | 25 --------------
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 27 +++++++--------
 arch/powerpc/include/asm/nohash/pgtable.h     | 10 ------
 arch/powerpc/mm/pgtable_64.c                  |  6 ++--
 arch/riscv/mm/hugetlbpage.c                   | 10 ------
 arch/s390/mm/hugetlbpage.c                    | 10 ------
 arch/sh/mm/hugetlbpage.c                      | 10 ------
 arch/sparc/mm/hugetlbpage.c                   | 12 -------
 arch/x86/mm/hugetlbpage.c                     | 26 --------------
 arch/x86/mm/pgtable.c                         |  4 +--
 include/linux/hugetlb.h                       | 24 -------------
 include/linux/pgtable.h                       | 24 ++++++++++---
 mm/gup.c                                      | 24 ++++++-------
 mm/hmm.c                                      |  9 ++---
 mm/memory.c                                   |  2 +-
 29 files changed, 68 insertions(+), 284 deletions(-)
 delete mode 100644 arch/arm/mm/hugetlbpage.c

Comments

Christophe Leroy March 14, 2024, 8:45 a.m. UTC | #1
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> From: Peter Xu <peterx@redhat.com>
> 
> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> it will keep returning false.
> 
> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> mappings.
> 
> The goal should be that we will have one API pXd_leaf() to detect all kinds
> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.

All kinds of huge mappings ?

pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are 
also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages 
and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report 
those huge pages.

> 
> This helps to simplify a follow up patch to drop pXd_huge() treewide.
> 
> NOTE: *_leaf() definition need to be moved before the inclusion of
> asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
> 
> [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   .../include/asm/book3s/64/pgtable-4k.h        | 14 ++--------
>   arch/powerpc/include/asm/book3s/64/pgtable.h  | 27 +++++++++----------
>   2 files changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> index 48f21820afe2..92545981bb49 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> @@ -8,22 +8,12 @@
>   #ifdef CONFIG_HUGETLB_PAGE
>   static inline int pmd_huge(pmd_t pmd)
>   {
> -	/*
> -	 * leaf pte for huge page
> -	 */
> -	if (radix_enabled())
> -		return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -	return 0;
> +	return pmd_leaf(pmd);
>   }
>   
>   static inline int pud_huge(pud_t pud)
>   {
> -	/*
> -	 * leaf pte for huge page
> -	 */
> -	if (radix_enabled())
> -		return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -	return 0;
> +	return pud_leaf(pud);
>   }
>   
>   /*
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index df66dce8306f..fd7180fded75 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end;
>   
>   extern struct page *vmemmap;
>   extern unsigned long pci_io_base;
> +
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
> +{
> +	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> +}
> +
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
> +{
> +	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> +}
>   #endif /* __ASSEMBLY__ */
>   
>   #include <asm/book3s/64/hash.h>
> @@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long new_va
>   	return false;
>   }
>   
> -/*
> - * Like pmd_huge(), but works regardless of config options
> - */
> -#define pmd_leaf pmd_leaf
> -static inline bool pmd_leaf(pmd_t pmd)
> -{
> -	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -#define pud_leaf pud_leaf
> -static inline bool pud_leaf(pud_t pud)
> -{
> -	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
> -
>   #endif /* __ASSEMBLY__ */
>   #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
Christophe Leroy March 14, 2024, 8:50 a.m. UTC | #2
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> From: Peter Xu <peterx@redhat.com>
> 
> Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> reuse it.  Luckily, pXd_huge() isn't widely used.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   arch/arm/include/asm/pgtable-3level.h | 2 +-
>   arch/arm64/include/asm/pgtable.h      | 2 +-
>   arch/arm64/mm/hugetlbpage.c           | 4 ++--
>   arch/loongarch/mm/hugetlbpage.c       | 2 +-
>   arch/mips/mm/tlb-r4k.c                | 2 +-
>   arch/powerpc/mm/pgtable_64.c          | 6 +++---
>   arch/x86/mm/pgtable.c                 | 4 ++--
>   mm/gup.c                              | 4 ++--
>   mm/hmm.c                              | 2 +-
>   mm/memory.c                           | 2 +-
>   10 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> index e7aecbef75c9..9e3c44f0aea2 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
>   #define pmd_dirty(pmd)		(pmd_isset((pmd), L_PMD_SECT_DIRTY))
>   
>   #define pmd_hugewillfault(pmd)	(!pmd_young(pmd) || !pmd_write(pmd))
> -#define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
> +#define pmd_thp_or_huge(pmd)	(pmd_leaf(pmd) || pmd_trans_huge(pmd))

Previous patch said pmd_trans_huge() implies pmd_leaf().

Or is that only for GUP ?

>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))


> diff --git a/mm/hmm.c b/mm/hmm.c
> index c95b9ec5d95f..93aebd9cc130 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
>   		return hmm_vma_walk_hole(start, end, -1, walk);
>   	}
>   
> -	if (pud_huge(pud) && pud_devmap(pud)) {
> +	if (pud_leaf(pud) && pud_devmap(pud)) {

Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

>   		unsigned long i, npages, pfn;
>   		unsigned int required_fault;
>   		unsigned long *hmm_pfns;
Christophe Leroy March 14, 2024, 8:56 a.m. UTC | #3
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> From: Peter Xu <peterx@redhat.com>
> 
> This API is not used anymore, drop it for the whole tree.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   arch/arm/mm/Makefile                          |  1 -
>   arch/arm/mm/hugetlbpage.c                     | 29 -------------------
>   arch/arm64/mm/hugetlbpage.c                   | 10 -------
>   arch/loongarch/mm/hugetlbpage.c               | 10 -------
>   arch/mips/include/asm/pgtable-32.h            |  2 +-
>   arch/mips/include/asm/pgtable-64.h            |  2 +-
>   arch/mips/mm/hugetlbpage.c                    | 10 -------
>   arch/parisc/mm/hugetlbpage.c                  | 11 -------
>   .../include/asm/book3s/64/pgtable-4k.h        | 10 -------
>   .../include/asm/book3s/64/pgtable-64k.h       | 25 ----------------
>   arch/powerpc/include/asm/nohash/pgtable.h     | 10 -------
>   arch/riscv/mm/hugetlbpage.c                   | 10 -------
>   arch/s390/mm/hugetlbpage.c                    | 10 -------
>   arch/sh/mm/hugetlbpage.c                      | 10 -------
>   arch/sparc/mm/hugetlbpage.c                   | 10 -------
>   arch/x86/mm/hugetlbpage.c                     | 16 ----------
>   include/linux/hugetlb.h                       | 24 ---------------
>   17 files changed, 2 insertions(+), 198 deletions(-)
>   delete mode 100644 arch/arm/mm/hugetlbpage.c
> 

> diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
> index 0e196650f4f4..92b7591aac2a 100644
> --- a/arch/mips/include/asm/pgtable-32.h
> +++ b/arch/mips/include/asm/pgtable-32.h
> @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
>   static inline int pmd_bad(pmd_t pmd)
>   {
>   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> -	/* pmd_huge(pmd) but inline */
> +	/* pmd_leaf(pmd) but inline */

Shouldn't this comment have been changed in patch 11 ?

>   	if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so 
it could be used here instead of open coping.

>   		return 0;
>   #endif
> diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
> index 20ca48c1b606..7c28510b3768 100644
> --- a/arch/mips/include/asm/pgtable-64.h
> +++ b/arch/mips/include/asm/pgtable-64.h
> @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
>   static inline int pmd_bad(pmd_t pmd)
>   {
>   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> -	/* pmd_huge(pmd) but inline */
> +	/* pmd_leaf(pmd) but inline */

Same

>   	if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Same

>   		return 0;
>   #endif

> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> index 2fce3498b000..579a7153857f 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> @@ -4,31 +4,6 @@
>   
>   #ifndef __ASSEMBLY__
>   #ifdef CONFIG_HUGETLB_PAGE
> -/*
> - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
> - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
> - *
> - * Defined in such a way that we can optimize away code block at build time
> - * if CONFIG_HUGETLB_PAGE=n.
> - *
> - * returns true for pmd migration entries, THP, devmap, hugetlb
> - * But compile time dependent on CONFIG_HUGETLB_PAGE
> - */

Should we keep this comment somewhere for documentation ?

> -static inline int pmd_huge(pmd_t pmd)
> -{
> -	/*
> -	 * leaf pte for huge page
> -	 */
> -	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -static inline int pud_huge(pud_t pud)
> -{
> -	/*
> -	 * leaf pte for huge page
> -	 */
> -	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
>   
>   /*
>    * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't
Peter Xu March 14, 2024, 12:59 p.m. UTC | #4
On Thu, Mar 14, 2024 at 08:50:20AM +0000, Christophe Leroy wrote:
> 
> 
> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> > From: Peter Xu <peterx@redhat.com>
> > 
> > Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> > reuse it.  Luckily, pXd_huge() isn't widely used.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   arch/arm/include/asm/pgtable-3level.h | 2 +-
> >   arch/arm64/include/asm/pgtable.h      | 2 +-
> >   arch/arm64/mm/hugetlbpage.c           | 4 ++--
> >   arch/loongarch/mm/hugetlbpage.c       | 2 +-
> >   arch/mips/mm/tlb-r4k.c                | 2 +-
> >   arch/powerpc/mm/pgtable_64.c          | 6 +++---
> >   arch/x86/mm/pgtable.c                 | 4 ++--
> >   mm/gup.c                              | 4 ++--
> >   mm/hmm.c                              | 2 +-
> >   mm/memory.c                           | 2 +-
> >   10 files changed, 15 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > index e7aecbef75c9..9e3c44f0aea2 100644
> > --- a/arch/arm/include/asm/pgtable-3level.h
> > +++ b/arch/arm/include/asm/pgtable-3level.h
> > @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
> >   #define pmd_dirty(pmd)		(pmd_isset((pmd), L_PMD_SECT_DIRTY))
> >   
> >   #define pmd_hugewillfault(pmd)	(!pmd_young(pmd) || !pmd_write(pmd))
> > -#define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
> > +#define pmd_thp_or_huge(pmd)	(pmd_leaf(pmd) || pmd_trans_huge(pmd))
> 
> Previous patch said pmd_trans_huge() implies pmd_leaf().

Ah here I remember I kept this arm definition there because I think we
should add a patch to drop pmd_thp_or_huge() completely.  If you won't mind
I can add one more patch instead of doing it here.  Then I keep this patch
purely as a replacement patch without further changes on arch-cleanups.

> 
> Or is that only for GUP ?

I think it should apply to all.

> 
> >   
> >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))
> 
> 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c95b9ec5d95f..93aebd9cc130 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
> >   		return hmm_vma_walk_hole(start, end, -1, walk);
> >   	}
> >   
> > -	if (pud_huge(pud) && pud_devmap(pud)) {
> > +	if (pud_leaf(pud) && pud_devmap(pud)) {
> 
> Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

This is an extra safety check that I didn't remove.  Devmap used separate
bits even though I'm not clear on why.  It should still imply a leaf though.

Thanks,

> 
> >   		unsigned long i, npages, pfn;
> >   		unsigned int required_fault;
> >   		unsigned long *hmm_pfns;
> 
>
Christophe Leroy March 14, 2024, 1:11 p.m. UTC | #5
Le 14/03/2024 à 13:53, Peter Xu a écrit :
> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>
>>
>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
>>> From: Peter Xu <peterx@redhat.com>
>>>
>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>> it will keep returning false.
>>>
>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>> mappings.
>>>
>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>
>> All kinds of huge mappings ?
>>
>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>> those huge pages.
> 
> Ah yes, I should always mention this is in the context of leaf huge pages
> only.  Are the examples you provided all fall into hugepd category?  If so
> I can reword the commit message, as:

On powerpc 8xx, only the 8M huge pages fall into the hugepd case.

The 512k hugepages are at PTE level, they are handled more or less like 
CONT_PTE on ARM. see function set_huge_pte_at() for more context.

You can also look at pte_leaf_size() and pgd_leaf_size().

By the way pgd_leaf_size() looks odd because it is called only when 
pgd_leaf_size() returns true, which never happens for 8M pages.

> 
>          As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to
>          create such huge mappings for 4K hash MMUs.  Meanwhile, the major
>          powerpc hugetlb pgtable walker __find_linux_pte() already used
>          pXd_leaf() to check leaf hugetlb mappings.
> 
>          The goal should be that we will have one API pXd_leaf() to detect
>          all kinds of huge mappings except hugepd.  AFAICT we need to use
>          the pXd_leaf() impl (rather than pXd_huge() ones) to make sure
>          ie. THPs on hash MMU will also return true.
> 
> Does this look good to you?
> 
> Thanks,
>
Jason Gunthorpe March 18, 2024, 4:15 p.m. UTC | #6
On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
> 
> 
> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> > On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
> >>
> >>
> >> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> >>> From: Peter Xu <peterx@redhat.com>
> >>>
> >>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> >>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> >>> it will keep returning false.
> >>>
> >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> >>> mappings.
> >>>
> >>> The goal should be that we will have one API pXd_leaf() to detect all kinds
> >>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >>
> >> All kinds of huge mappings ?
> >>
> >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >> those huge pages.
> > 
> > Ah yes, I should always mention this is in the context of leaf huge pages
> > only.  Are the examples you provided all fall into hugepd category?  If so
> > I can reword the commit message, as:
> 
> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
> 
> The 512k hugepages are at PTE level, they are handled more or less like 
> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
> 
> You can also look at pte_leaf_size() and pgd_leaf_size().

IMHO leaf should return false if the thing is pointing to a next level
page table, even if that next level is fully populated with contiguous
pages.

This seems more aligned with the contig page direction that hugepd
should be moved over to..

> By the way pgd_leaf_size() looks odd because it is called only when 
> pgd_leaf_size() returns true, which never happens for 8M pages.

Like this, you should reach the actual final leaf that the HW will
load and leaf_size() should say it is greater size than the current
table level. Other levels should return 0.

If necessary the core MM code should deal with this by iterating over
adjacent tables.

Jason
Jason Gunthorpe March 18, 2024, 4:16 p.m. UTC | #7
On Thu, Mar 14, 2024 at 08:59:22AM -0400, Peter Xu wrote:

> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
> > >   		return hmm_vma_walk_hole(start, end, -1, walk);
> > >   	}
> > >   
> > > -	if (pud_huge(pud) && pud_devmap(pud)) {
> > > +	if (pud_leaf(pud) && pud_devmap(pud)) {
> > 
> > Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?
> 
> This is an extra safety check that I didn't remove.  Devmap used separate
> bits even though I'm not clear on why.  It should still imply a leaf though.

Yes, something is very wrong if devmap is true on non-leaf..

Jason
Christophe Leroy March 19, 2024, 11:07 p.m. UTC | #8
Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>>>> it will keep returning false.
>>>>>
>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>>>> mappings.
>>>>>
>>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>>>
>>>> All kinds of huge mappings ?
>>>>
>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>> those huge pages.
>>>
>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>> only.  Are the examples you provided all fall into hugepd category?  If so
>>> I can reword the commit message, as:
>>
>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>
>> The 512k hugepages are at PTE level, they are handled more or less like
>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>
>> You can also look at pte_leaf_size() and pgd_leaf_size().
> 
> IMHO leaf should return false if the thing is pointing to a next level
> page table, even if that next level is fully populated with contiguous
> pages.
> 
> This seems more aligned with the contig page direction that hugepd
> should be moved over to..

Should hugepd be moved to the contig page direction, really ?

Would it be acceptable that a 8M hugepage requires 2048 contig entries 
in 2 page tables, when the hugepd allows a single entry ? Would it be 
acceptable performancewise ?

> 
>> By the way pgd_leaf_size() looks odd because it is called only when
>> pgd_leaf_size() returns true, which never happens for 8M pages.
> 
> Like this, you should reach the actual final leaf that the HW will
> load and leaf_size() should say it is greater size than the current
> table level. Other levels should return 0.
> 
> If necessary the core MM code should deal with this by iterating over
> adjacent tables.
> 
> Jason
Jason Gunthorpe March 19, 2024, 11:26 p.m. UTC | #9
On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote:
> 
> 
> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> > On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
> >>
> >>
> >> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> >>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
> >>>>
> >>>>
> >>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
> >>>>> From: Peter Xu <peterx@redhat.com>
> >>>>>
> >>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> >>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> >>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> >>>>> it will keep returning false.
> >>>>>
> >>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> >>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> >>>>> mappings.
> >>>>>
> >>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
> >>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> >>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >>>>
> >>>> All kinds of huge mappings ?
> >>>>
> >>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >>>> those huge pages.
> >>>
> >>> Ah yes, I should always mention this is in the context of leaf huge pages
> >>> only.  Are the examples you provided all fall into hugepd category?  If so
> >>> I can reword the commit message, as:
> >>
> >> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
> >>
> >> The 512k hugepages are at PTE level, they are handled more or less like
> >> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
> >>
> >> You can also look at pte_leaf_size() and pgd_leaf_size().
> > 
> > IMHO leaf should return false if the thing is pointing to a next level
> > page table, even if that next level is fully populated with contiguous
> > pages.
> > 
> > This seems more aligned with the contig page direction that hugepd
> > should be moved over to..
> 
> Should hugepd be moved to the contig page direction, really ?

Sure? Is there any downside for the reading side to do so?

> Would it be acceptable that a 8M hugepage requires 2048 contig entries
> in 2 page tables, when the hugepd allows a single entry ? 

? I thought we agreed the only difference would be that something new
is needed to merge the two identical sibling page tables into one, ie
you pay 2x the page table memory if that isn't fixed. That is write
side only change and I imagine it could be done with a single PPC
special API.

Honestly not totally sure that is a big deal, it is already really
memory inefficient compared to every other arch's huge page by needing
the child page table in the first place.

> Would it be acceptable performancewise ?

Isn't this particular PPC sub platform ancient? Are there current real
users that are going to have hugetlbfs special code and care about
this performance detail on a 6.20 era kernel?

In today's world wouldn't it be performance better if these platforms
could support THP by aligning to the contig API instead of being
special?

Am I wrong to question why we are polluting the core code for this
special optimization?

Jason
Christophe Leroy March 20, 2024, 6:16 a.m. UTC | #10
Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit :
> On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
>>> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>>>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit :
>>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>>
>>>>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>>>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
>>>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>>>>>> it will keep returning false.
>>>>>>>
>>>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
>>>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>>>>>> mappings.
>>>>>>>
>>>>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>>>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
>>>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>>>>>
>>>>>> All kinds of huge mappings ?
>>>>>>
>>>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>>>> those huge pages.
>>>>>
>>>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>>>> only.  Are the examples you provided all fall into hugepd category?  If so
>>>>> I can reword the commit message, as:
>>>>
>>>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>>>
>>>> The 512k hugepages are at PTE level, they are handled more or less like
>>>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>>>
>>>> You can also look at pte_leaf_size() and pgd_leaf_size().
>>>
>>> IMHO leaf should return false if the thing is pointing to a next level
>>> page table, even if that next level is fully populated with contiguous
>>> pages.
>>>
>>> This seems more aligned with the contig page direction that hugepd
>>> should be moved over to..
>>
>> Should hugepd be moved to the contig page direction, really ?
> 
> Sure? Is there any downside for the reading side to do so?

Probably not.

> 
>> Would it be acceptable that a 8M hugepage requires 2048 contig entries
>> in 2 page tables, when the hugepd allows a single entry ?
> 
> ? I thought we agreed the only difference would be that something new
> is needed to merge the two identical sibling page tables into one, ie
> you pay 2x the page table memory if that isn't fixed. That is write
> side only change and I imagine it could be done with a single PPC
> special API.
> 
> Honestly not totally sure that is a big deal, it is already really
> memory inefficient compared to every other arch's huge page by needing
> the child page table in the first place.
> 
>> Would it be acceptable performancewise ?
> 
> Isn't this particular PPC sub platform ancient? Are there current real
> users that are going to have hugetlbfs special code and care about
> this performance detail on a 6.20 era kernel?

Ancient yes but still widely in use and with the emergence of voice over 
IP in Air Trafic Control, performance becomes more and more challenge 
with those old boards that have another 10 years in front of them.

> 
> In today's world wouldn't it be performance better if these platforms
> could support THP by aligning to the contig API instead of being
> special?

Indeed, if we can promote THP that'd be even better.

> 
> Am I wrong to question why we are polluting the core code for this
> special optimization?

At the first place that was to get a close fit between hardware 
pagetable topology and linux pagetable topology. But obviously we 
already stepped back for 512k pages, so let's go one more step aside and 
do similar with 8M pages.

I'll give it a try and see how it goes.

Christophe
Christophe Leroy March 20, 2024, 5:40 p.m. UTC | #11
Le 20/03/2024 à 17:09, Peter Xu a écrit :
> On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
>> At the first place that was to get a close fit between hardware
>> pagetable topology and linux pagetable topology. But obviously we
>> already stepped back for 512k pages, so let's go one more step aside and
>> do similar with 8M pages.
>>
>> I'll give it a try and see how it goes.
> 
> So you're talking about 8M only for 8xx, am I right?

Yes I am.

> 
> There seem to be other PowerPC systems use hugepd.  Is it possible that we
> convert all hugepd into cont_pte form?

Indeed.

Seems like we have hugepd for book3s/64 and for nohash.

For book3s I don't know, may Aneesh can answer.

For nohash I think it should be possible because TLB misses are handled 
by software. Even the e6500 which has a hardware tablewalk falls back on 
software walk when it is a hugepage IIUC.

Christophe
Peter Xu March 20, 2024, 8:24 p.m. UTC | #12
On Wed, Mar 20, 2024 at 05:40:39PM +0000, Christophe Leroy wrote:
> 
> 
> Le 20/03/2024 à 17:09, Peter Xu a écrit :
> > On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
> >> At the first place that was to get a close fit between hardware
> >> pagetable topology and linux pagetable topology. But obviously we
> >> already stepped back for 512k pages, so let's go one more step aside and
> >> do similar with 8M pages.
> >>
> >> I'll give it a try and see how it goes.
> > 
> > So you're talking about 8M only for 8xx, am I right?
> 
> Yes I am.
> 
> > 
> > There seem to be other PowerPC systems use hugepd.  Is it possible that we
> > convert all hugepd into cont_pte form?
> 
> Indeed.
> 
> Seems like we have hugepd for book3s/64 and for nohash.
> 
> For book3s I don't know, may Aneesh can answer.
> 
> For nohash I think it should be possible because TLB misses are handled 
> by software. Even the e6500 which has a hardware tablewalk falls back on 
> software walk when it is a hugepage IIUC.

It'll be great if I can get some answer here, and then I know the path for
hugepd in general.  I don't want to add any new code into core mm to
something destined to fade away soon.

One option for me is I can check a macro of hugepd existance, so all new
code will only work when hugepd is not supported on such arch.  However
that'll start to make some PowerPC systems special (which I still tried
hard to avoid, if that wasn't proved in the past..), meanwhile we'll also
need to keep some generic-mm paths (that I can already remove along with
the new code) only for these hugepd systems.  But it's still okay to me,
it'll be just a matter of when to drop those codes, sooner or later.

Thanks,