Message ID | 20240313214719.253873-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/treewide: Remove pXd_huge() API | expand |
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > From: Peter Xu <peterx@redhat.com> > > PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() > constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], > it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so > it will keep returning false. > > As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create > such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb > pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb > mappings. > > The goal should be that we will have one API pXd_leaf() to detect all kinds > of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than > pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. All kinds of huge mappings ? pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report those huge pages. > > This helps to simplify a follow up patch to drop pXd_huge() treewide. > > NOTE: *_leaf() definition need to be moved before the inclusion of > asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it. > > [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse > > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Christophe Leroy <christophe.leroy@csgroup.eu> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> > Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > .../include/asm/book3s/64/pgtable-4k.h | 14 ++-------- > arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +++++++++---------- > 2 files changed, 14 insertions(+), 27 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h > index 48f21820afe2..92545981bb49 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h > @@ -8,22 +8,12 @@ > #ifdef CONFIG_HUGETLB_PAGE > static inline int pmd_huge(pmd_t pmd) > { > - /* > - * leaf pte for huge page > - */ > - if (radix_enabled()) > - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); > - return 0; > + return pmd_leaf(pmd); > } > > static inline int pud_huge(pud_t pud) > { > - /* > - * leaf pte for huge page > - */ > - if (radix_enabled()) > - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); > - return 0; > + return pud_leaf(pud); > } > > /* > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h > index df66dce8306f..fd7180fded75 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end; > > extern struct page *vmemmap; > extern unsigned long pci_io_base; > + > +#define pmd_leaf pmd_leaf > +static inline bool pmd_leaf(pmd_t pmd) > +{ > + return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); > +} > + > +#define pud_leaf pud_leaf > +static inline bool pud_leaf(pud_t pud) > +{ > + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); > +} > #endif /* __ASSEMBLY__ */ > > #include <asm/book3s/64/hash.h> > @@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long new_va > return false; > } > > -/* > - * Like pmd_huge(), but works regardless of config options > - */ > -#define pmd_leaf pmd_leaf > -static inline bool pmd_leaf(pmd_t pmd) > -{ > - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); > -} > - > -#define pud_leaf pud_leaf > -static inline bool pud_leaf(pud_t pud) > -{ > - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); > -} > - > #endif /* __ASSEMBLY__ */ > #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > From: Peter Xu <peterx@redhat.com> > > Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), > reuse it. Luckily, pXd_huge() isn't widely used. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/arm/include/asm/pgtable-3level.h | 2 +- > arch/arm64/include/asm/pgtable.h | 2 +- > arch/arm64/mm/hugetlbpage.c | 4 ++-- > arch/loongarch/mm/hugetlbpage.c | 2 +- > arch/mips/mm/tlb-r4k.c | 2 +- > arch/powerpc/mm/pgtable_64.c | 6 +++--- > arch/x86/mm/pgtable.c | 4 ++-- > mm/gup.c | 4 ++-- > mm/hmm.c | 2 +- > mm/memory.c | 2 +- > 10 files changed, 15 insertions(+), 15 deletions(-) > > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h > index e7aecbef75c9..9e3c44f0aea2 100644 > --- a/arch/arm/include/asm/pgtable-3level.h > +++ b/arch/arm/include/asm/pgtable-3level.h > @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte) > #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY)) > > #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd)) > -#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd)) > +#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd)) Previous patch said pmd_trans_huge() implies pmd_leaf(). Or is that only for GUP ? > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd)) > diff --git a/mm/hmm.c b/mm/hmm.c > index c95b9ec5d95f..93aebd9cc130 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, > return hmm_vma_walk_hole(start, end, -1, walk); > } > > - if (pud_huge(pud) && pud_devmap(pud)) { > + if (pud_leaf(pud) && pud_devmap(pud)) { Didn't previous patch say devmap implies leaf ? Or is it only for GUP ? > unsigned long i, npages, pfn; > unsigned int required_fault; > unsigned long *hmm_pfns;
Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > From: Peter Xu <peterx@redhat.com> > > This API is not used anymore, drop it for the whole tree. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/arm/mm/Makefile | 1 - > arch/arm/mm/hugetlbpage.c | 29 ------------------- > arch/arm64/mm/hugetlbpage.c | 10 ------- > arch/loongarch/mm/hugetlbpage.c | 10 ------- > arch/mips/include/asm/pgtable-32.h | 2 +- > arch/mips/include/asm/pgtable-64.h | 2 +- > arch/mips/mm/hugetlbpage.c | 10 ------- > arch/parisc/mm/hugetlbpage.c | 11 ------- > .../include/asm/book3s/64/pgtable-4k.h | 10 ------- > .../include/asm/book3s/64/pgtable-64k.h | 25 ---------------- > arch/powerpc/include/asm/nohash/pgtable.h | 10 ------- > arch/riscv/mm/hugetlbpage.c | 10 ------- > arch/s390/mm/hugetlbpage.c | 10 ------- > arch/sh/mm/hugetlbpage.c | 10 ------- > arch/sparc/mm/hugetlbpage.c | 10 ------- > arch/x86/mm/hugetlbpage.c | 16 ---------- > include/linux/hugetlb.h | 24 --------------- > 17 files changed, 2 insertions(+), 198 deletions(-) > delete mode 100644 arch/arm/mm/hugetlbpage.c > > diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h > index 0e196650f4f4..92b7591aac2a 100644 > --- a/arch/mips/include/asm/pgtable-32.h > +++ b/arch/mips/include/asm/pgtable-32.h > @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd) > static inline int pmd_bad(pmd_t pmd) > { > #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT > - /* pmd_huge(pmd) but inline */ > + /* pmd_leaf(pmd) but inline */ Shouldn't this comment have been changed in patch 11 ? > if (unlikely(pmd_val(pmd) & _PAGE_HUGE)) Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so it could be used here instead of open coping. > return 0; > #endif > diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h > index 20ca48c1b606..7c28510b3768 100644 > --- a/arch/mips/include/asm/pgtable-64.h > +++ b/arch/mips/include/asm/pgtable-64.h > @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd) > static inline int pmd_bad(pmd_t pmd) > { > #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT > - /* pmd_huge(pmd) but inline */ > + /* pmd_leaf(pmd) but inline */ Same > if (unlikely(pmd_val(pmd) & _PAGE_HUGE)) Same > return 0; > #endif > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h > index 2fce3498b000..579a7153857f 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h > @@ -4,31 +4,6 @@ > > #ifndef __ASSEMBLY__ > #ifdef CONFIG_HUGETLB_PAGE > -/* > - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have > - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD; > - * > - * Defined in such a way that we can optimize away code block at build time > - * if CONFIG_HUGETLB_PAGE=n. > - * > - * returns true for pmd migration entries, THP, devmap, hugetlb > - * But compile time dependent on CONFIG_HUGETLB_PAGE > - */ Should we keep this comment somewhere for documentation ? > -static inline int pmd_huge(pmd_t pmd) > -{ > - /* > - * leaf pte for huge page > - */ > - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); > -} > - > -static inline int pud_huge(pud_t pud) > -{ > - /* > - * leaf pte for huge page > - */ > - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); > -} > > /* > * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't
On Thu, Mar 14, 2024 at 08:50:20AM +0000, Christophe Leroy wrote: > > > Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > > From: Peter Xu <peterx@redhat.com> > > > > Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), > > reuse it. Luckily, pXd_huge() isn't widely used. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > arch/arm/include/asm/pgtable-3level.h | 2 +- > > arch/arm64/include/asm/pgtable.h | 2 +- > > arch/arm64/mm/hugetlbpage.c | 4 ++-- > > arch/loongarch/mm/hugetlbpage.c | 2 +- > > arch/mips/mm/tlb-r4k.c | 2 +- > > arch/powerpc/mm/pgtable_64.c | 6 +++--- > > arch/x86/mm/pgtable.c | 4 ++-- > > mm/gup.c | 4 ++-- > > mm/hmm.c | 2 +- > > mm/memory.c | 2 +- > > 10 files changed, 15 insertions(+), 15 deletions(-) > > > > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h > > index e7aecbef75c9..9e3c44f0aea2 100644 > > --- a/arch/arm/include/asm/pgtable-3level.h > > +++ b/arch/arm/include/asm/pgtable-3level.h > > @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte) > > #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY)) > > > > #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd)) > > -#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd)) > > +#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd)) > > Previous patch said pmd_trans_huge() implies pmd_leaf(). Ah here I remember I kept this arm definition there because I think we should add a patch to drop pmd_thp_or_huge() completely. If you won't mind I can add one more patch instead of doing it here. Then I keep this patch purely as a replacement patch without further changes on arch-cleanups. > > Or is that only for GUP ? I think it should apply to all. > > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd)) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index c95b9ec5d95f..93aebd9cc130 100644 > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, > > return hmm_vma_walk_hole(start, end, -1, walk); > > } > > > > - if (pud_huge(pud) && pud_devmap(pud)) { > > + if (pud_leaf(pud) && pud_devmap(pud)) { > > Didn't previous patch say devmap implies leaf ? Or is it only for GUP ? This is an extra safety check that I didn't remove. Devmap used separate bits even though I'm not clear on why. It should still imply a leaf though. Thanks, > > > unsigned long i, npages, pfn; > > unsigned int required_fault; > > unsigned long *hmm_pfns; > >
Le 14/03/2024 à 13:53, Peter Xu a écrit : > On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote: >> >> >> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : >>> From: Peter Xu <peterx@redhat.com> >>> >>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() >>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so >>> it will keep returning false. >>> >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create >>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb >>> mappings. >>> >>> The goal should be that we will have one API pXd_leaf() to detect all kinds >>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. >> >> All kinds of huge mappings ? >> >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report >> those huge pages. > > Ah yes, I should always mention this is in the context of leaf huge pages > only. Are the examples you provided all fall into hugepd category? If so > I can reword the commit message, as: On powerpc 8xx, only the 8M huge pages fall into the hugepd case. The 512k hugepages are at PTE level, they are handled more or less like CONT_PTE on ARM. see function set_huge_pte_at() for more context. You can also look at pte_leaf_size() and pgd_leaf_size(). By the way pgd_leaf_size() looks odd because it is called only when pgd_leaf_size() returns true, which never happens for 8M pages. > > As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to > create such huge mappings for 4K hash MMUs. Meanwhile, the major > powerpc hugetlb pgtable walker __find_linux_pte() already used > pXd_leaf() to check leaf hugetlb mappings. > > The goal should be that we will have one API pXd_leaf() to detect > all kinds of huge mappings except hugepd. AFAICT we need to use > the pXd_leaf() impl (rather than pXd_huge() ones) to make sure > ie. THPs on hash MMU will also return true. > > Does this look good to you? > > Thanks, >
On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote: > > > Le 14/03/2024 à 13:53, Peter Xu a écrit : > > On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote: > >> > >> > >> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > >>> From: Peter Xu <peterx@redhat.com> > >>> > >>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() > >>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], > >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so > >>> it will keep returning false. > >>> > >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create > >>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb > >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb > >>> mappings. > >>> > >>> The goal should be that we will have one API pXd_leaf() to detect all kinds > >>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than > >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. > >> > >> All kinds of huge mappings ? > >> > >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are > >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages > >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report > >> those huge pages. > > > > Ah yes, I should always mention this is in the context of leaf huge pages > > only. Are the examples you provided all fall into hugepd category? If so > > I can reword the commit message, as: > > On powerpc 8xx, only the 8M huge pages fall into the hugepd case. > > The 512k hugepages are at PTE level, they are handled more or less like > CONT_PTE on ARM. see function set_huge_pte_at() for more context. > > You can also look at pte_leaf_size() and pgd_leaf_size(). IMHO leaf should return false if the thing is pointing to a next level page table, even if that next level is fully populated with contiguous pages. This seems more aligned with the contig page direction that hugepd should be moved over to.. > By the way pgd_leaf_size() looks odd because it is called only when > pgd_leaf_size() returns true, which never happens for 8M pages. Like this, you should reach the actual final leaf that the HW will load and leaf_size() should say it is greater size than the current table level. Other levels should return 0. If necessary the core MM code should deal with this by iterating over adjacent tables. Jason
On Thu, Mar 14, 2024 at 08:59:22AM -0400, Peter Xu wrote: > > > --- a/mm/hmm.c > > > +++ b/mm/hmm.c > > > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, > > > return hmm_vma_walk_hole(start, end, -1, walk); > > > } > > > > > > - if (pud_huge(pud) && pud_devmap(pud)) { > > > + if (pud_leaf(pud) && pud_devmap(pud)) { > > > > Didn't previous patch say devmap implies leaf ? Or is it only for GUP ? > > This is an extra safety check that I didn't remove. Devmap used separate > bits even though I'm not clear on why. It should still imply a leaf though. Yes, something is very wrong if devmap is true on non-leaf.. Jason
Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit : > On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote: >> >> >> Le 14/03/2024 à 13:53, Peter Xu a écrit : >>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote: >>>> >>>> >>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : >>>>> From: Peter Xu <peterx@redhat.com> >>>>> >>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() >>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], >>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so >>>>> it will keep returning false. >>>>> >>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create >>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb >>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb >>>>> mappings. >>>>> >>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds >>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than >>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. >>>> >>>> All kinds of huge mappings ? >>>> >>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are >>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages >>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report >>>> those huge pages. >>> >>> Ah yes, I should always mention this is in the context of leaf huge pages >>> only. Are the examples you provided all fall into hugepd category? If so >>> I can reword the commit message, as: >> >> On powerpc 8xx, only the 8M huge pages fall into the hugepd case. >> >> The 512k hugepages are at PTE level, they are handled more or less like >> CONT_PTE on ARM. see function set_huge_pte_at() for more context. >> >> You can also look at pte_leaf_size() and pgd_leaf_size(). > > IMHO leaf should return false if the thing is pointing to a next level > page table, even if that next level is fully populated with contiguous > pages. > > This seems more aligned with the contig page direction that hugepd > should be moved over to.. Should hugepd be moved to the contig page direction, really ? Would it be acceptable that a 8M hugepage requires 2048 contig entries in 2 page tables, when the hugepd allows a single entry ? Would it be acceptable performancewise ? > >> By the way pgd_leaf_size() looks odd because it is called only when >> pgd_leaf_size() returns true, which never happens for 8M pages. > > Like this, you should reach the actual final leaf that the HW will > load and leaf_size() should say it is greater size than the current > table level. Other levels should return 0. > > If necessary the core MM code should deal with this by iterating over > adjacent tables. > > Jason
On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote: > > > Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit : > > On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote: > >> > >> > >> Le 14/03/2024 à 13:53, Peter Xu a écrit : > >>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote: > >>>> > >>>> > >>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : > >>>>> From: Peter Xu <peterx@redhat.com> > >>>>> > >>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() > >>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], > >>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so > >>>>> it will keep returning false. > >>>>> > >>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create > >>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb > >>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb > >>>>> mappings. > >>>>> > >>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds > >>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than > >>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. > >>>> > >>>> All kinds of huge mappings ? > >>>> > >>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are > >>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages > >>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report > >>>> those huge pages. > >>> > >>> Ah yes, I should always mention this is in the context of leaf huge pages > >>> only. Are the examples you provided all fall into hugepd category? If so > >>> I can reword the commit message, as: > >> > >> On powerpc 8xx, only the 8M huge pages fall into the hugepd case. > >> > >> The 512k hugepages are at PTE level, they are handled more or less like > >> CONT_PTE on ARM. see function set_huge_pte_at() for more context. > >> > >> You can also look at pte_leaf_size() and pgd_leaf_size(). > > > > IMHO leaf should return false if the thing is pointing to a next level > > page table, even if that next level is fully populated with contiguous > > pages. > > > > This seems more aligned with the contig page direction that hugepd > > should be moved over to.. > > Should hugepd be moved to the contig page direction, really ? Sure? Is there any downside for the reading side to do so? > Would it be acceptable that a 8M hugepage requires 2048 contig entries > in 2 page tables, when the hugepd allows a single entry ? ? I thought we agreed the only difference would be that something new is needed to merge the two identical sibling page tables into one, ie you pay 2x the page table memory if that isn't fixed. That is write side only change and I imagine it could be done with a single PPC special API. Honestly not totally sure that is a big deal, it is already really memory inefficient compared to every other arch's huge page by needing the child page table in the first place. > Would it be acceptable performancewise ? Isn't this particular PPC sub platform ancient? Are there current real users that are going to have hugetlbfs special code and care about this performance detail on a 6.20 era kernel? In today's world wouldn't it be performance better if these platforms could support THP by aligning to the contig API instead of being special? Am I wrong to question why we are polluting the core code for this special optimization? Jason
Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit : > On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote: >> >> >> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit : >>> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote: >>>> >>>> >>>> Le 14/03/2024 à 13:53, Peter Xu a écrit : >>>>> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote: >>>>>> >>>>>> >>>>>> Le 13/03/2024 à 22:47, peterx@redhat.com a écrit : >>>>>>> From: Peter Xu <peterx@redhat.com> >>>>>>> >>>>>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() >>>>>>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], >>>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so >>>>>>> it will keep returning false. >>>>>>> >>>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create >>>>>>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb >>>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb >>>>>>> mappings. >>>>>>> >>>>>>> The goal should be that we will have one API pXd_leaf() to detect all kinds >>>>>>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather than >>>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true. >>>>>> >>>>>> All kinds of huge mappings ? >>>>>> >>>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are >>>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages >>>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report >>>>>> those huge pages. >>>>> >>>>> Ah yes, I should always mention this is in the context of leaf huge pages >>>>> only. Are the examples you provided all fall into hugepd category? If so >>>>> I can reword the commit message, as: >>>> >>>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case. >>>> >>>> The 512k hugepages are at PTE level, they are handled more or less like >>>> CONT_PTE on ARM. see function set_huge_pte_at() for more context. >>>> >>>> You can also look at pte_leaf_size() and pgd_leaf_size(). >>> >>> IMHO leaf should return false if the thing is pointing to a next level >>> page table, even if that next level is fully populated with contiguous >>> pages. >>> >>> This seems more aligned with the contig page direction that hugepd >>> should be moved over to.. >> >> Should hugepd be moved to the contig page direction, really ? > > Sure? Is there any downside for the reading side to do so? Probably not. > >> Would it be acceptable that a 8M hugepage requires 2048 contig entries >> in 2 page tables, when the hugepd allows a single entry ? > > ? I thought we agreed the only difference would be that something new > is needed to merge the two identical sibling page tables into one, ie > you pay 2x the page table memory if that isn't fixed. That is write > side only change and I imagine it could be done with a single PPC > special API. > > Honestly not totally sure that is a big deal, it is already really > memory inefficient compared to every other arch's huge page by needing > the child page table in the first place. > >> Would it be acceptable performancewise ? > > Isn't this particular PPC sub platform ancient? Are there current real > users that are going to have hugetlbfs special code and care about > this performance detail on a 6.20 era kernel? Ancient yes but still widely in use and with the emergence of voice over IP in Air Trafic Control, performance becomes more and more challenge with those old boards that have another 10 years in front of them. > > In today's world wouldn't it be performance better if these platforms > could support THP by aligning to the contig API instead of being > special? Indeed, if we can promote THP that'd be even better. > > Am I wrong to question why we are polluting the core code for this > special optimization? At the first place that was to get a close fit between hardware pagetable topology and linux pagetable topology. But obviously we already stepped back for 512k pages, so let's go one more step aside and do similar with 8M pages. I'll give it a try and see how it goes. Christophe
Le 20/03/2024 à 17:09, Peter Xu a écrit : > On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote: >> At the first place that was to get a close fit between hardware >> pagetable topology and linux pagetable topology. But obviously we >> already stepped back for 512k pages, so let's go one more step aside and >> do similar with 8M pages. >> >> I'll give it a try and see how it goes. > > So you're talking about 8M only for 8xx, am I right? Yes I am. > > There seem to be other PowerPC systems use hugepd. Is it possible that we > convert all hugepd into cont_pte form? Indeed. Seems like we have hugepd for book3s/64 and for nohash. For book3s I don't know, may Aneesh can answer. For nohash I think it should be possible because TLB misses are handled by software. Even the e6500 which has a hardware tablewalk falls back on software walk when it is a hugepage IIUC. Christophe
On Wed, Mar 20, 2024 at 05:40:39PM +0000, Christophe Leroy wrote: > > > Le 20/03/2024 à 17:09, Peter Xu a écrit : > > On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote: > >> At the first place that was to get a close fit between hardware > >> pagetable topology and linux pagetable topology. But obviously we > >> already stepped back for 512k pages, so let's go one more step aside and > >> do similar with 8M pages. > >> > >> I'll give it a try and see how it goes. > > > > So you're talking about 8M only for 8xx, am I right? > > Yes I am. > > > > > There seem to be other PowerPC systems use hugepd. Is it possible that we > > convert all hugepd into cont_pte form? > > Indeed. > > Seems like we have hugepd for book3s/64 and for nohash. > > For book3s I don't know, may Aneesh can answer. > > For nohash I think it should be possible because TLB misses are handled > by software. Even the e6500 which has a hardware tablewalk falls back on > software walk when it is a hugepage IIUC. It'll be great if I can get some answer here, and then I know the path for hugepd in general. I don't want to add any new code into core mm to something destined to fade away soon. One option for me is I can check a macro of hugepd existance, so all new code will only work when hugepd is not supported on such arch. However that'll start to make some PowerPC systems special (which I still tried hard to avoid, if that wasn't proved in the past..), meanwhile we'll also need to keep some generic-mm paths (that I can already remove along with the new code) only for these hugepd systems. But it's still okay to me, it'll be just a matter of when to drop those codes, sooner or later. Thanks,
From: Peter Xu <peterx@redhat.com> [based on akpm/mm-unstable latest commit 9af2e4c429b5] v1: - Rebase, remove RFC tag - Fixed powerpc patch build issue, enhancing commit message [Michael] - Optimize patch 1 & 3 on "none || !present" check [Jason] In previous work [1], we removed the pXd_large() API, which is arch specific. This patchset further removes the hugetlb pXd_huge() API. Hugetlb was never special on creating huge mappings when compared with other huge mappings. Having a standalone API just to detect such pgtable entries is more or less redundant, especially after the pXd_leaf() API set is introduced with/without CONFIG_HUGETLB_PAGE. When looking at this problem, a few issues are also exposed that we don't have a clear definition of the *_huge() variance API. This patchset started by cleaning these issues first, then replace all *_huge() users to use *_leaf(), then drop all *_huge() code. On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for all the rest archs they're reported "false" instead. This part is done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll leave that to hmm experts to decide. Besides, there are three archs (arm, arm64, powerpc) that have slightly different definitions between the *_huge() v.s. *_leaf() variances. I tackled them separately so that it'll be easier for arch experts to chim in when necessary. This part is done in patch 6-9. The final patches 10-13 do the rest on the final removal, since *_leaf() will be the ultimate API in the future, and we seem to have quite some confusions on how *_huge() APIs can be defined, provide a rich comment for *_leaf() API set to define them properly to avoid future misuse, and hopefully that'll also help new archs to start support huge mappings and avoid traps (like either swap entries, or PROT_NONE entry checks). The whole series is only lightly tested on x86, while as usual I don't have the capability to test all archs that it touches. [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com Peter Xu (13): mm/hmm: Process pud swap entry without pud_huge() mm/gup: Cache p4d in follow_p4d_mask() mm/gup: Check p4d presence before going on mm/x86: Change pXd_huge() behavior to exclude swap entries mm/sparc: Change pXd_huge() behavior to exclude swap entries mm/arm: Use macros to define pmd/pud helpers mm/arm: Redefine pmd_huge() with pmd_leaf() mm/arm64: Merge pXd_huge() and pXd_leaf() definitions mm/powerpc: Redefine pXd_huge() with pXd_leaf() mm/gup: Merge pXd huge mapping checks mm/treewide: Replace pXd_huge() with pXd_leaf() mm/treewide: Remove pXd_huge() mm: Document pXd_leaf() API arch/arm/include/asm/pgtable-2level.h | 4 +-- arch/arm/include/asm/pgtable-3level-hwdef.h | 1 + arch/arm/include/asm/pgtable-3level.h | 6 ++-- arch/arm/mm/Makefile | 1 - arch/arm/mm/hugetlbpage.c | 34 ------------------- arch/arm64/include/asm/pgtable.h | 6 +++- arch/arm64/mm/hugetlbpage.c | 18 ++-------- arch/loongarch/mm/hugetlbpage.c | 12 +------ arch/mips/include/asm/pgtable-32.h | 2 +- arch/mips/include/asm/pgtable-64.h | 2 +- arch/mips/mm/hugetlbpage.c | 10 ------ arch/mips/mm/tlb-r4k.c | 2 +- arch/parisc/mm/hugetlbpage.c | 11 ------ .../include/asm/book3s/64/pgtable-4k.h | 20 ----------- .../include/asm/book3s/64/pgtable-64k.h | 25 -------------- arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +++++++-------- arch/powerpc/include/asm/nohash/pgtable.h | 10 ------ arch/powerpc/mm/pgtable_64.c | 6 ++-- arch/riscv/mm/hugetlbpage.c | 10 ------ arch/s390/mm/hugetlbpage.c | 10 ------ arch/sh/mm/hugetlbpage.c | 10 ------ arch/sparc/mm/hugetlbpage.c | 12 ------- arch/x86/mm/hugetlbpage.c | 26 -------------- arch/x86/mm/pgtable.c | 4 +-- include/linux/hugetlb.h | 24 ------------- include/linux/pgtable.h | 24 ++++++++++--- mm/gup.c | 24 ++++++------- mm/hmm.c | 9 ++--- mm/memory.c | 2 +- 29 files changed, 68 insertions(+), 284 deletions(-) delete mode 100644 arch/arm/mm/hugetlbpage.c