diff mbox

[3/3] arm64, mm: Use IPIs for TLB invalidation.

Message ID 1436646323-10527-4-git-send-email-ddaney.cavm@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

David Daney July 11, 2015, 8:25 p.m. UTC
From: David Daney <david.daney@cavium.com>

Most broadcast TLB invalidations are unnecessary.  So when
invalidating for a given mm/vma target the only the needed CPUs via
and IPI.

For global TLB invalidations, also use IPI.

Tested on Cavium ThunderX.

This change reduces 'time make -j48' on kernel from 139s to 116s (83%
as long).

The patch is needed because of a ThunderX Pass1 erratum: Exclusive
store operations unreliable in the presence of broadcast TLB
invalidations.  The performance improvements shown make it compelling
even without the erratum workaround need.

Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/arm64/include/asm/tlbflush.h | 67 ++++++---------------------------------
 arch/arm64/mm/flush.c             | 46 +++++++++++++++++++++++++++
 2 files changed, 56 insertions(+), 57 deletions(-)

Comments

Sergei Shtylyov July 11, 2015, 10:06 p.m. UTC | #1
Hello.

On 07/11/2015 11:25 PM, David Daney wrote:

> From: David Daney <david.daney@cavium.com>

> Most broadcast TLB invalidations are unnecessary.  So when
> invalidating for a given mm/vma target the only the needed CPUs via

    The only the needed?

> and IPI.

> For global TLB invalidations, also use IPI.

> Tested on Cavium ThunderX.

> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

> The patch is needed because of a ThunderX Pass1 erratum: Exclusive
> store operations unreliable in the presence of broadcast TLB
> invalidations.  The performance improvements shown make it compelling
> even without the erratum workaround need.

> Signed-off-by: David Daney <david.daney@cavium.com>

WBR, Sergei
Catalin Marinas July 12, 2015, 9:58 p.m. UTC | #2
On Sat, Jul 11, 2015 at 01:25:23PM -0700, David Daney wrote:
> From: David Daney <david.daney@cavium.com>
> 
> Most broadcast TLB invalidations are unnecessary.  So when
> invalidating for a given mm/vma target the only the needed CPUs via
> and IPI.
> 
> For global TLB invalidations, also use IPI.
> 
> Tested on Cavium ThunderX.
> 
> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

Have you tried something like kernbench? It tends to be more consistent
than a simple "time make".

However, the main question is how's the performance on systems with a
lot less CPUs (like 4 to 8)? The results are highly dependent on the
type of application, CPU and SoC implementation (I've done similar
benchmarks in the past). So, I don't think it's a simple answer here.

> The patch is needed because of a ThunderX Pass1 erratum: Exclusive
> store operations unreliable in the presence of broadcast TLB
> invalidations. The performance improvements shown make it compelling
> even without the erratum workaround need.

This performance argument is debatable, I need more data and not just
for the Cavium boards and kernel building. In the meantime, it's an
erratum workaround and it needs to follow the other workarounds we have
in the kernel with a proper Kconfig option and alternatives that can be
patched in our out at run-time (I wonder whether jump labels would be
better suited here).
Will Deacon July 13, 2015, 6:17 p.m. UTC | #3
On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote:
> From: David Daney <david.daney@cavium.com>
> 
> Most broadcast TLB invalidations are unnecessary.  So when
> invalidating for a given mm/vma target the only the needed CPUs via
> and IPI.
> 
> For global TLB invalidations, also use IPI.
> 
> Tested on Cavium ThunderX.
> 
> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

Any idea *why* you're seeing such an improvement? Some older kernels had
a bug where we'd try to flush a negative (i.e. huge) range by page, so it
would be nice to rule that out. I assume these measurements are using
mainline?

Having TLBI responsible for that amount of a kernel build doesn't feel
right to me and doesn't line-up with the profiles I'm used to seeing.

You have 16-bit ASIDs, right?

Will
David Daney July 13, 2015, 6:58 p.m. UTC | #4
On 07/13/2015 11:17 AM, Will Deacon wrote:
> On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote:
>> From: David Daney <david.daney@cavium.com>
>>
>> Most broadcast TLB invalidations are unnecessary.  So when
>> invalidating for a given mm/vma target the only the needed CPUs via
>> and IPI.
>>
>> For global TLB invalidations, also use IPI.
>>
>> Tested on Cavium ThunderX.
>>
>> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
>> as long).
>
> Any idea *why* you're seeing such an improvement? Some older kernels had
> a bug where we'd try to flush a negative (i.e. huge) range by page, so it
> would be nice to rule that out. I assume these measurements are using
> mainline?

I have an untested multi-part theory:

1) Most of the invalidations in the kernel build will be for a mm that 
was only used on a single CPU (the current CPU), so IPIs are for the 
most part not needed.  We win by not having to synchronize across all 
CPUs waiting for the DSB to complete.  I think most of it occurs at 
process exit.  Q: why do anything at process exit?  The use of ASIDs 
should make TLB invalidations at process death unnecessary.

2) By simplifying the VA range invalidations to just a single ASID based 
invalidation, we are issuing many fewer TLBI broadcasts.  The overhead 
of refilling the local TLB with still needed mappings may be lower than 
the overhead of all those TLBI operations.

>
> Having TLBI responsible for that amount of a kernel build doesn't feel
> right to me and doesn't line-up with the profiles I'm used to seeing.

I don't have enough information to comment on this at the moment.

>
> You have 16-bit ASIDs, right?

Correct.  This means we aren't doing the rollover work very often, and 
that it is therefore not a significant source of system overhead.


>
> Will
>
diff mbox

Patch

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 42c09ec..2c132b0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -63,46 +63,22 @@ 
  *		only require the D-TLB to be invalidated.
  *		- kaddr - Kernel virtual memory address
  */
-static inline void flush_tlb_all(void)
-{
-	dsb(ishst);
-	asm("tlbi	vmalle1is");
-	dsb(ish);
-	isb();
-}
-
-static inline void flush_tlb_mm(struct mm_struct *mm)
-{
-	unsigned long asid = (unsigned long)ASID(mm) << 48;
+void flush_tlb_all(void);
 
-	dsb(ishst);
-	asm("tlbi	aside1is, %0" : : "r" (asid));
-	dsb(ish);
-}
+void flush_tlb_mm(struct mm_struct *mm);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
-	unsigned long addr = uaddr >> 12 |
-		((unsigned long)ASID(vma->vm_mm) << 48);
-
-	dsb(ishst);
-	asm("tlbi	vae1is, %0" : : "r" (addr));
-	dsb(ish);
+	/* Simplify to entire mm. */
+	flush_tlb_mm(vma->vm_mm);
 }
 
 static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end)
 {
-	unsigned long asid = (unsigned long)ASID(vma->vm_mm) << 48;
-	unsigned long addr;
-	start = asid | (start >> 12);
-	end = asid | (end >> 12);
-
-	dsb(ishst);
-	for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
-		asm("tlbi vae1is, %0" : : "r"(addr));
-	dsb(ish);
+	/* Simplify to entire mm. */
+	flush_tlb_mm(vma->vm_mm);
 }
 
 static inline void flush_tlb_all_local(void)
@@ -112,40 +88,17 @@  static inline void flush_tlb_all_local(void)
 	isb();
 }
 
-static inline void __flush_tlb_kernel_range(unsigned long start, unsigned long end)
-{
-	unsigned long addr;
-	start >>= 12;
-	end >>= 12;
-
-	dsb(ishst);
-	for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
-		asm("tlbi vaae1is, %0" : : "r"(addr));
-	dsb(ish);
-	isb();
-}
-
-/*
- * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
- * necessarily a performance improvement.
- */
-#define MAX_TLB_RANGE	(1024UL << PAGE_SHIFT)
-
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
-	if ((end - start) <= MAX_TLB_RANGE)
-		__flush_tlb_range(vma, start, end);
-	else
-		flush_tlb_mm(vma->vm_mm);
+	/* Simplify to entire mm. */
+	flush_tlb_mm(vma->vm_mm);
 }
 
 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	if ((end - start) <= MAX_TLB_RANGE)
-		__flush_tlb_kernel_range(start, end);
-	else
-		flush_tlb_all();
+	/* Simplify to all. */
+	flush_tlb_all();
 }
 
 /*
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 4dfa397..45f24d3 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@ 
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/smp.h>
 
 #include <asm/cacheflush.h>
 #include <asm/cachetype.h>
@@ -27,6 +28,51 @@ 
 
 #include "mm.h"
 
+static void flush_tlb_local(void *info)
+{
+	asm volatile("\n"
+		     "	tlbi	vmalle1\n"
+		     "	isb	sy"
+		);
+}
+
+static void flush_tlb_mm_local(void *info)
+{
+	unsigned long asid = (unsigned long)info;
+
+	asm volatile("\n"
+		     "	tlbi	aside1, %0\n"
+		     "	isb	sy"
+		     : : "r" (asid)
+		);
+}
+
+void flush_tlb_all(void)
+{
+	/* Make sure page table modifications are visible. */
+	dsb(ishst);
+	/* IPI to all CPUs to do local flush. */
+	on_each_cpu(flush_tlb_local, NULL, 1);
+
+}
+EXPORT_SYMBOL(flush_tlb_all);
+
+void flush_tlb_mm(struct mm_struct *mm)
+{
+	if (!mm) {
+		flush_tlb_all();
+	} else {
+		unsigned long asid = (unsigned long)ASID(mm) << 48;
+		/* Make sure page table modifications are visible. */
+		dsb(ishst);
+		/* IPI to all CPUs to do local flush. */
+		on_each_cpu_mask(mm_cpumask(mm),
+				 flush_tlb_mm_local, (void *)asid, 1);
+	}
+
+}
+EXPORT_SYMBOL(flush_tlb_mm);
+
 void flush_cache_range(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end)
 {