diff mbox series

[v2] riscv: mm: Implement pmdp_collapse_flush for THP

Message ID 20230130074815.1694055-1-mchitale@ventanamicro.com (mailing list archive)
State Accepted
Commit f0293cd1f4fcc4fbdcd65a5a7b3b318a6d471f78
Delegated to: Palmer Dabbelt
Headers show
Series [v2] riscv: mm: Implement pmdp_collapse_flush for THP | expand

Checks

Context Check Description
conchuod/cover_letter success Single patches do not need cover letters
conchuod/tree_selection success Guessed tree name to be fixes
conchuod/fixes_present success Fixes tag present in non-next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 13 and now 13
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 2014 this patch: 2014
conchuod/alphanumeric_selects success Out of order selects before the patch: 57 and now 57
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 2 this patch: 2
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: extern prototypes should be avoided in .h files WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants
conchuod/source_inline success Was 0 now: 0
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success Fixes tag looks correct
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Mayuresh Chitale Jan. 30, 2023, 7:48 a.m. UTC
When THP is enabled, 4K pages are collapsed into a single huge
page using the generic pmdp_collapse_flush() which will further
use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
using address specific SFENCEs which results in repetitive (or
unpredictable) page faults on RISC-V implementations which cache
non-leaf PTEs.

Provide a RISC-V specific pmdp_collapse_flush() which ensures both
cached leaf and non-leaf PTEs are invalidated by using non-address
specific SFENCEs as recommended by the RISC-V privileged specification.

Fixes: e88b333142e4 ("riscv: mm: add THP support on 64-bit")
Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
---
 arch/riscv/include/asm/pgtable.h |  4 ++++
 arch/riscv/mm/pgtable.c          | 26 ++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

Comments

Andrew Jones Jan. 30, 2023, 8:03 a.m. UTC | #1
On Mon, Jan 30, 2023 at 01:18:15PM +0530, Mayuresh Chitale wrote:
> When THP is enabled, 4K pages are collapsed into a single huge
> page using the generic pmdp_collapse_flush() which will further
> use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
> the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
> using address specific SFENCEs which results in repetitive (or
> unpredictable) page faults on RISC-V implementations which cache
> non-leaf PTEs.
> 
> Provide a RISC-V specific pmdp_collapse_flush() which ensures both
> cached leaf and non-leaf PTEs are invalidated by using non-address
> specific SFENCEs as recommended by the RISC-V privileged specification.
> 
> Fixes: e88b333142e4 ("riscv: mm: add THP support on 64-bit")
> Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
> ---

Please add a changelog here under the --- to explain the differences
between this version and the last version. b4-diff shows me what
changed, but the changelog should be present to explain why it
changed.

>  arch/riscv/include/asm/pgtable.h |  4 ++++
>  arch/riscv/mm/pgtable.c          | 26 ++++++++++++++++++++++++++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 4eba9a98d0e3..3e01f4f3ab08 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -721,6 +721,10 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  	page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd);
>  	return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd)));
>  }
> +
> +#define pmdp_collapse_flush pmdp_collapse_flush
> +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  /*
> diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
> index 6645ead1a7c1..5da1916c231e 100644
> --- a/arch/riscv/mm/pgtable.c
> +++ b/arch/riscv/mm/pgtable.c
> @@ -81,3 +81,29 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>  }
>  
>  #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmdp)
> +{
> +	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON(pmd_trans_huge(*pmdp));

These checks should come before the use of variables being checked, as is
done in the common version of the function.

> +	/*
> +	 * When leaf PTE enteries (regular pages) are collapsed into a leaf
> +	 * PMD entry (huge page), a valid non-leaf PTE is converted into a
> +	 * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
> +	 * specification allows implementations to cache valid non-leaf PTEs,
> +	 * but the section "4.2.1 Supervisor Memory-Management Fence
> +	 * Instruction" recommends the following:
> +	 * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
> +	 * with rs1=x0. If any PTE along the traversal path had its G bit set,
> +	 * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
> +	 * the translation is being modified."
> +	 * Based on the above recommendation, we should do full flush whenever
> +	 * leaf PTE entries are collapsed into a leaf PMD entry.
> +	 */
> +	flush_tlb_mm(vma->vm_mm);
> +	return pmd;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -- 
> 2.34.1
>

Thanks,
drew
Palmer Dabbelt Feb. 2, 2023, 4:35 a.m. UTC | #2
On Sun, 29 Jan 2023 23:48:15 PST (-0800), mchitale@ventanamicro.com wrote:
> When THP is enabled, 4K pages are collapsed into a single huge
> page using the generic pmdp_collapse_flush() which will further
> use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
> the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
> using address specific SFENCEs which results in repetitive (or
> unpredictable) page faults on RISC-V implementations which cache
> non-leaf PTEs.
>
> Provide a RISC-V specific pmdp_collapse_flush() which ensures both
> cached leaf and non-leaf PTEs are invalidated by using non-address
> specific SFENCEs as recommended by the RISC-V privileged specification.
>
> Fixes: e88b333142e4 ("riscv: mm: add THP support on 64-bit")
> Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
> ---
>  arch/riscv/include/asm/pgtable.h |  4 ++++
>  arch/riscv/mm/pgtable.c          | 26 ++++++++++++++++++++++++++
>  2 files changed, 30 insertions(+)
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 4eba9a98d0e3..3e01f4f3ab08 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -721,6 +721,10 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  	page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd);
>  	return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd)));
>  }
> +
> +#define pmdp_collapse_flush pmdp_collapse_flush
> +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  /*
> diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
> index 6645ead1a7c1..5da1916c231e 100644
> --- a/arch/riscv/mm/pgtable.c
> +++ b/arch/riscv/mm/pgtable.c
> @@ -81,3 +81,29 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>  }
>
>  #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmdp)
> +{
> +	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON(pmd_trans_huge(*pmdp));
> +	/*
> +	 * When leaf PTE enteries (regular pages) are collapsed into a leaf
> +	 * PMD entry (huge page), a valid non-leaf PTE is converted into a
> +	 * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
> +	 * specification allows implementations to cache valid non-leaf PTEs,
> +	 * but the section "4.2.1 Supervisor Memory-Management Fence
> +	 * Instruction" recommends the following:
> +	 * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
> +	 * with rs1=x0. If any PTE along the traversal path had its G bit set,
> +	 * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
> +	 * the translation is being modified."
> +	 * Based on the above recommendation, we should do full flush whenever
> +	 * leaf PTE entries are collapsed into a leaf PMD entry.

It's generally best to ignore the recommendations in the commentary 
about flushes, they assume a specific software model that doesn't always 
apply to Linux.  In this case we do need the fence, though, but I 
changed the comment to explain it differently (an fix at least one 
spelling mistake).

I've squashed this in and have it in staging, if that looks good I'll 
put it on fixes.  Thanks!

diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 5da1916c231e..fef4e7328e49 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -90,18 +90,12 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	VM_BUG_ON(pmd_trans_huge(*pmdp));
 	/*
-	 * When leaf PTE enteries (regular pages) are collapsed into a leaf
+	 * When leaf PTE entries (regular pages) are collapsed into a leaf
 	 * PMD entry (huge page), a valid non-leaf PTE is converted into a
-	 * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
-	 * specification allows implementations to cache valid non-leaf PTEs,
-	 * but the section "4.2.1 Supervisor Memory-Management Fence
-	 * Instruction" recommends the following:
-	 * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
-	 * with rs1=x0. If any PTE along the traversal path had its G bit set,
-	 * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
-	 * the translation is being modified."
-	 * Based on the above recommendation, we should do full flush whenever
-	 * leaf PTE entries are collapsed into a leaf PMD entry.
+	 * valid leaf PTE at the level 1 page table.  Since the sfence.vma
+	 * forms that specify an address only apply to leaf PTEs, we need a
+	 * global flush here.  collapse_huge_page() assumes these flushes are
+	 * eager, so just do the fence here.
 	 */
 	flush_tlb_mm(vma->vm_mm);
 	return pmd;


> +	 */
> +	flush_tlb_mm(vma->vm_mm);
> +	return pmd;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
Anup Patel Feb. 3, 2023, 5:23 p.m. UTC | #3
On Thu, Feb 2, 2023 at 10:05 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Sun, 29 Jan 2023 23:48:15 PST (-0800), mchitale@ventanamicro.com wrote:
> > When THP is enabled, 4K pages are collapsed into a single huge
> > page using the generic pmdp_collapse_flush() which will further
> > use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
> > the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
> > using address specific SFENCEs which results in repetitive (or
> > unpredictable) page faults on RISC-V implementations which cache
> > non-leaf PTEs.
> >
> > Provide a RISC-V specific pmdp_collapse_flush() which ensures both
> > cached leaf and non-leaf PTEs are invalidated by using non-address
> > specific SFENCEs as recommended by the RISC-V privileged specification.
> >
> > Fixes: e88b333142e4 ("riscv: mm: add THP support on 64-bit")
> > Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
> > ---
> >  arch/riscv/include/asm/pgtable.h |  4 ++++
> >  arch/riscv/mm/pgtable.c          | 26 ++++++++++++++++++++++++++
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index 4eba9a98d0e3..3e01f4f3ab08 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -721,6 +721,10 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >       page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd);
> >       return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd)));
> >  }
> > +
> > +#define pmdp_collapse_flush pmdp_collapse_flush
> > +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> > +                              unsigned long address, pmd_t *pmdp);
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> >  /*
> > diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
> > index 6645ead1a7c1..5da1916c231e 100644
> > --- a/arch/riscv/mm/pgtable.c
> > +++ b/arch/riscv/mm/pgtable.c
> > @@ -81,3 +81,29 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
> >  }
> >
> >  #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> > +                                     unsigned long address, pmd_t *pmdp)
> > +{
> > +     pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> > +
> > +     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > +     VM_BUG_ON(pmd_trans_huge(*pmdp));
> > +     /*
> > +      * When leaf PTE enteries (regular pages) are collapsed into a leaf
> > +      * PMD entry (huge page), a valid non-leaf PTE is converted into a
> > +      * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
> > +      * specification allows implementations to cache valid non-leaf PTEs,
> > +      * but the section "4.2.1 Supervisor Memory-Management Fence
> > +      * Instruction" recommends the following:
> > +      * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
> > +      * with rs1=x0. If any PTE along the traversal path had its G bit set,
> > +      * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
> > +      * the translation is being modified."
> > +      * Based on the above recommendation, we should do full flush whenever
> > +      * leaf PTE entries are collapsed into a leaf PMD entry.
>
> It's generally best to ignore the recommendations in the commentary
> about flushes, they assume a specific software model that doesn't always
> apply to Linux.  In this case we do need the fence, though, but I
> changed the comment to explain it differently (an fix at least one
> spelling mistake).
>
> I've squashed this in and have it in staging, if that looks good I'll
> put it on fixes.  Thanks!
>
> diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
> index 5da1916c231e..fef4e7328e49 100644
> --- a/arch/riscv/mm/pgtable.c
> +++ b/arch/riscv/mm/pgtable.c
> @@ -90,18 +90,12 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>         VM_BUG_ON(pmd_trans_huge(*pmdp));
>         /*
> -        * When leaf PTE enteries (regular pages) are collapsed into a leaf
> +        * When leaf PTE entries (regular pages) are collapsed into a leaf
>          * PMD entry (huge page), a valid non-leaf PTE is converted into a
> -        * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
> -        * specification allows implementations to cache valid non-leaf PTEs,
> -        * but the section "4.2.1 Supervisor Memory-Management Fence
> -        * Instruction" recommends the following:
> -        * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
> -        * with rs1=x0. If any PTE along the traversal path had its G bit set,
> -        * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
> -        * the translation is being modified."
> -        * Based on the above recommendation, we should do full flush whenever
> -        * leaf PTE entries are collapsed into a leaf PMD entry.
> +        * valid leaf PTE at the level 1 page table.  Since the sfence.vma
> +        * forms that specify an address only apply to leaf PTEs, we need a
> +        * global flush here.  collapse_huge_page() assumes these flushes are
> +        * eager, so just do the fence here.
>          */
>         flush_tlb_mm(vma->vm_mm);
>         return pmd;

This looks good to me. Please add it to the list of fixes.

Regards,
Anup

>
>
> > +      */
> > +     flush_tlb_mm(vma->vm_mm);
> > +     return pmd;
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
patchwork-bot+linux-riscv@kernel.org Feb. 9, 2023, 7:40 p.m. UTC | #4
Hello:

This patch was applied to riscv/linux.git (fixes)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Mon, 30 Jan 2023 13:18:15 +0530 you wrote:
> When THP is enabled, 4K pages are collapsed into a single huge
> page using the generic pmdp_collapse_flush() which will further
> use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
> the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
> using address specific SFENCEs which results in repetitive (or
> unpredictable) page faults on RISC-V implementations which cache
> non-leaf PTEs.
> 
> [...]

Here is the summary with links:
  - [v2] riscv: mm: Implement pmdp_collapse_flush for THP
    https://git.kernel.org/riscv/c/f0293cd1f4fc

You are awesome, thank you!
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 4eba9a98d0e3..3e01f4f3ab08 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -721,6 +721,10 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 	page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd);
 	return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd)));
 }
+
+#define pmdp_collapse_flush pmdp_collapse_flush
+extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 6645ead1a7c1..5da1916c231e 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -81,3 +81,29 @@  int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 }
 
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp)
+{
+	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(pmd_trans_huge(*pmdp));
+	/*
+	 * When leaf PTE enteries (regular pages) are collapsed into a leaf
+	 * PMD entry (huge page), a valid non-leaf PTE is converted into a
+	 * valid leaf PTE at the level 1 page table. The RISC-V privileged v1.12
+	 * specification allows implementations to cache valid non-leaf PTEs,
+	 * but the section "4.2.1 Supervisor Memory-Management Fence
+	 * Instruction" recommends the following:
+	 * "If software modifies a non-leaf PTE, it should execute SFENCE.VMA
+	 * with rs1=x0. If any PTE along the traversal path had its G bit set,
+	 * rs2 must be x0; otherwise, rs2 should be set to the ASID for which
+	 * the translation is being modified."
+	 * Based on the above recommendation, we should do full flush whenever
+	 * leaf PTE entries are collapsed into a leaf PMD entry.
+	 */
+	flush_tlb_mm(vma->vm_mm);
+	return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */