diff mbox series

[v6,07/11] mm/mremap: Use range flush that does TLB and page walk cache flush

Message ID 20210524090114.63446-8-aneesh.kumar@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series Speedup mremap on ppc64 | expand

Commit Message

Aneesh Kumar K.V May 24, 2021, 9:01 a.m. UTC
Some architectures do have the concept of page walk cache which need
to be flush when updating higher levels of page tables. A fast mremap
that involves moving page table pages instead of copying pte entries
should flush page walk cache since the old translation cache is no more
valid.

Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
page walk cache where TLB entries are mapped with page size PAGE_SIZE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 10 ++++++++++
 mm/mremap.c                                   | 14 ++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

Comments

Linus Torvalds May 24, 2021, 5:02 p.m. UTC | #1
On Sun, May 23, 2021 at 11:04 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
> page walk cache where TLB entries are mapped with page size PAGE_SIZE.

So I dislike this patch for two reasons:

 (a) naming.

If the ppc people want to use crazy TLA's that have no meaning outside
of the powerpc community, that's fine. But only in powerpc code.

"pwc" makes no sense to me, or to anybody else that isn't intimately
involved in low-level powerpc stuff. I assume it's "page walk cache",
but honestly, outside of this area, PWC is mostly used for a specific
type of webcam.

So there's no way I'd accept this as-is, simply because of that.
flush_pte_tlb_pwc_range() is simply not an acceptable name. You would
have to spell it out, not use an obscure TLA.

But I think you don't even want to do that, because of

 (b) is this even worth it as a public interface?

Why doesn't the powerpc radix TLB flushing code just always flush the
page table walking cache when the range is larger than a PMD?

Once you have big flush ranges like that, I don't believe it makes any
sense not to flush the walking cache too.

NOTE! This is particularly true as "flush the walking cache" isn't a
well-defined operation anyway. Which _levels_ of the walking cache?
Again, the size (and alignment) of the flush would actually tell you.
A new boolean "flush" parameter does *NOT* tell that at all.

So I think this new interface is mis-named, but I also think it's
pointless. Just DTRT automatically when somebody asks for a flush that
covers a PMD range (or a PUD range).

              Linus
Aneesh Kumar K.V May 25, 2021, 1:27 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, May 23, 2021 at 11:04 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
>> page walk cache where TLB entries are mapped with page size PAGE_SIZE.
>
> So I dislike this patch for two reasons:
>
>  (a) naming.
>
> If the ppc people want to use crazy TLA's that have no meaning outside
> of the powerpc community, that's fine. But only in powerpc code.
>
> "pwc" makes no sense to me, or to anybody else that isn't intimately
> involved in low-level powerpc stuff. I assume it's "page walk cache",
> but honestly, outside of this area, PWC is mostly used for a specific
> type of webcam.
>
> So there's no way I'd accept this as-is, simply because of that.
> flush_pte_tlb_pwc_range() is simply not an acceptable name. You would
> have to spell it out, not use an obscure TLA.
>
> But I think you don't even want to do that, because of

How about flush_tlb_and_page_table_cache() ?

>
>  (b) is this even worth it as a public interface?
>
> Why doesn't the powerpc radix TLB flushing code just always flush the
> page table walking cache when the range is larger than a PMD?
>
> Once you have big flush ranges like that, I don't believe it makes any
> sense not to flush the walking cache too.

But such a large range invalidate doesn't imply we are freeing page
tables. Hence forcing a page table cache flush for large range
invalidate can have performance impact. ppc64 don't do a range page
table cache invalidate. Hence we will have to flush the full page table
cache.

>
> NOTE! This is particularly true as "flush the walking cache" isn't a
> well-defined operation anyway. Which _levels_ of the walking cache?
> Again, the size (and alignment) of the flush would actually tell you.
> A new boolean "flush" parameter does *NOT* tell that at all.
>
> So I think this new interface is mis-named, but I also think it's
> pointless. Just DTRT automatically when somebody asks for a flush that
> covers a PMD range (or a PUD range).
>
>               Linus

-aneesh
Linus Torvalds May 25, 2021, 5:08 p.m. UTC | #3
On Tue, May 25, 2021 at 3:28 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> How about flush_tlb_and_page_table_cache() ?

Honestly, I'd prefer it to be a separate function.

So keep the existing

     flush_tlb()

as-is, and add a

        flush_tlb_walking_cache()

and document that any architecture that flushes the walker caches as
part of the regular tlb flush can just make that a no-op.

Would that work for powerpc?

But:

> >  (b) is this even worth it as a public interface?
>
> But such a large range invalidate doesn't imply we are freeing page
> tables.

No. But it's what everybody else (ie x86 and ARM) does, and if you're
flushing megabytes of TLB's, what's the downside of flushing a few TLB
walker cache entries?

You already do that for internal powerpc errata anyway (ie
"mm_needs_flush_escalation()"), so I'm saying that you might as well
treat the page walker cache as a powerpc-internal implementation
thing.

Put another way: can you even _measure_ the difference between "just
make powerpc look like everybody else" and "add a new explicit page
table walker cache flush function interface"?

Now, from a quick look at the powerpc code, it looks like powerpc is a
bit mis-architected, and when you flush the walker cache, you flush
everything for that ASID. x86 and arm only flush the parts affected by
the TLB flush range (now, admittedly, that's what they do
_architecturally_ - for all I know the actual hardware just always
flushes all walker caches when you flush any TLB and so maybe they act
exactly like the powrpc RIC_FLUSH_PWC in practice).

So maybe it's measurable. But I kind of doubt it, and I'd like to know
that you've actually done some testing to see that "yes, this matters,
I can't just say 'if flushing more than a pmd, just flush the walker
cache too'".

Hmm?

                Linus
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index f9f8a3a264f7..e84fee9db106 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -80,6 +80,16 @@  static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
 	return flush_hugetlb_tlb_pwc_range(vma, start, end, false);
 }
 
+#define flush_pte_tlb_pwc_range flush_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+					   unsigned long start, unsigned long end)
+{
+	if (radix_enabled())
+		return radix__flush_tlb_pwc_range_psize(vma->vm_mm, start,
+							end, mmu_virtual_psize, true);
+	return hash__flush_tlb_range(vma, start, end);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
index 7372c8c0cf26..000a71917557 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -210,6 +210,16 @@  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		drop_rmap_locks(vma);
 }
 
+#ifndef flush_pte_tlb_pwc_range
+#define flush_pte_tlb_pwc_range flush_pte_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+					   unsigned long start,
+					   unsigned long end)
+{
+	return flush_tlb_range(vma, start, end);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -260,7 +270,7 @@  static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 	VM_BUG_ON(!pmd_none(*new_pmd));
 	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 
-	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+	flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
 	spin_unlock(old_ptl);
@@ -307,7 +317,7 @@  static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 	VM_BUG_ON(!pud_none(*new_pud));
 
 	pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
-	flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
+	flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
 	spin_unlock(old_ptl);