Message ID | 1436646323-10527-4-git-send-email-ddaney.cavm@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello. On 07/11/2015 11:25 PM, David Daney wrote: > From: David Daney <david.daney@cavium.com> > Most broadcast TLB invalidations are unnecessary. So when > invalidating for a given mm/vma target the only the needed CPUs via The only the needed? > and IPI. > For global TLB invalidations, also use IPI. > Tested on Cavium ThunderX. > This change reduces 'time make -j48' on kernel from 139s to 116s (83% > as long). > The patch is needed because of a ThunderX Pass1 erratum: Exclusive > store operations unreliable in the presence of broadcast TLB > invalidations. The performance improvements shown make it compelling > even without the erratum workaround need. > Signed-off-by: David Daney <david.daney@cavium.com> WBR, Sergei
On Sat, Jul 11, 2015 at 01:25:23PM -0700, David Daney wrote: > From: David Daney <david.daney@cavium.com> > > Most broadcast TLB invalidations are unnecessary. So when > invalidating for a given mm/vma target the only the needed CPUs via > and IPI. > > For global TLB invalidations, also use IPI. > > Tested on Cavium ThunderX. > > This change reduces 'time make -j48' on kernel from 139s to 116s (83% > as long). Have you tried something like kernbench? It tends to be more consistent than a simple "time make". However, the main question is how's the performance on systems with a lot less CPUs (like 4 to 8)? The results are highly dependent on the type of application, CPU and SoC implementation (I've done similar benchmarks in the past). So, I don't think it's a simple answer here. > The patch is needed because of a ThunderX Pass1 erratum: Exclusive > store operations unreliable in the presence of broadcast TLB > invalidations. The performance improvements shown make it compelling > even without the erratum workaround need. This performance argument is debatable, I need more data and not just for the Cavium boards and kernel building. In the meantime, it's an erratum workaround and it needs to follow the other workarounds we have in the kernel with a proper Kconfig option and alternatives that can be patched in our out at run-time (I wonder whether jump labels would be better suited here).
On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote: > From: David Daney <david.daney@cavium.com> > > Most broadcast TLB invalidations are unnecessary. So when > invalidating for a given mm/vma target the only the needed CPUs via > and IPI. > > For global TLB invalidations, also use IPI. > > Tested on Cavium ThunderX. > > This change reduces 'time make -j48' on kernel from 139s to 116s (83% > as long). Any idea *why* you're seeing such an improvement? Some older kernels had a bug where we'd try to flush a negative (i.e. huge) range by page, so it would be nice to rule that out. I assume these measurements are using mainline? Having TLBI responsible for that amount of a kernel build doesn't feel right to me and doesn't line-up with the profiles I'm used to seeing. You have 16-bit ASIDs, right? Will
On 07/13/2015 11:17 AM, Will Deacon wrote: > On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote: >> From: David Daney <david.daney@cavium.com> >> >> Most broadcast TLB invalidations are unnecessary. So when >> invalidating for a given mm/vma target the only the needed CPUs via >> and IPI. >> >> For global TLB invalidations, also use IPI. >> >> Tested on Cavium ThunderX. >> >> This change reduces 'time make -j48' on kernel from 139s to 116s (83% >> as long). > > Any idea *why* you're seeing such an improvement? Some older kernels had > a bug where we'd try to flush a negative (i.e. huge) range by page, so it > would be nice to rule that out. I assume these measurements are using > mainline? I have an untested multi-part theory: 1) Most of the invalidations in the kernel build will be for a mm that was only used on a single CPU (the current CPU), so IPIs are for the most part not needed. We win by not having to synchronize across all CPUs waiting for the DSB to complete. I think most of it occurs at process exit. Q: why do anything at process exit? The use of ASIDs should make TLB invalidations at process death unnecessary. 2) By simplifying the VA range invalidations to just a single ASID based invalidation, we are issuing many fewer TLBI broadcasts. The overhead of refilling the local TLB with still needed mappings may be lower than the overhead of all those TLBI operations. > > Having TLBI responsible for that amount of a kernel build doesn't feel > right to me and doesn't line-up with the profiles I'm used to seeing. I don't have enough information to comment on this at the moment. > > You have 16-bit ASIDs, right? Correct. This means we aren't doing the rollover work very often, and that it is therefore not a significant source of system overhead. > > Will >
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 42c09ec..2c132b0 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -63,46 +63,22 @@ * only require the D-TLB to be invalidated. * - kaddr - Kernel virtual memory address */ -static inline void flush_tlb_all(void) -{ - dsb(ishst); - asm("tlbi vmalle1is"); - dsb(ish); - isb(); -} - -static inline void flush_tlb_mm(struct mm_struct *mm) -{ - unsigned long asid = (unsigned long)ASID(mm) << 48; +void flush_tlb_all(void); - dsb(ishst); - asm("tlbi aside1is, %0" : : "r" (asid)); - dsb(ish); -} +void flush_tlb_mm(struct mm_struct *mm); static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) { - unsigned long addr = uaddr >> 12 | - ((unsigned long)ASID(vma->vm_mm) << 48); - - dsb(ishst); - asm("tlbi vae1is, %0" : : "r" (addr)); - dsb(ish); + /* Simplify to entire mm. */ + flush_tlb_mm(vma->vm_mm); } static inline void __flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - unsigned long asid = (unsigned long)ASID(vma->vm_mm) << 48; - unsigned long addr; - start = asid | (start >> 12); - end = asid | (end >> 12); - - dsb(ishst); - for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12)) - asm("tlbi vae1is, %0" : : "r"(addr)); - dsb(ish); + /* Simplify to entire mm. */ + flush_tlb_mm(vma->vm_mm); } static inline void flush_tlb_all_local(void) @@ -112,40 +88,17 @@ static inline void flush_tlb_all_local(void) isb(); } -static inline void __flush_tlb_kernel_range(unsigned long start, unsigned long end) -{ - unsigned long addr; - start >>= 12; - end >>= 12; - - dsb(ishst); - for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12)) - asm("tlbi vaae1is, %0" : : "r"(addr)); - dsb(ish); - isb(); -} - -/* - * This is meant to avoid soft lock-ups on large TLB flushing ranges and not - * necessarily a performance improvement. - */ -#define MAX_TLB_RANGE (1024UL << PAGE_SHIFT) - static inline void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - if ((end - start) <= MAX_TLB_RANGE) - __flush_tlb_range(vma, start, end); - else - flush_tlb_mm(vma->vm_mm); + /* Simplify to entire mm. */ + flush_tlb_mm(vma->vm_mm); } static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end) { - if ((end - start) <= MAX_TLB_RANGE) - __flush_tlb_kernel_range(start, end); - else - flush_tlb_all(); + /* Simplify to all. */ + flush_tlb_all(); } /* diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index 4dfa397..45f24d3 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -20,6 +20,7 @@ #include <linux/export.h> #include <linux/mm.h> #include <linux/pagemap.h> +#include <linux/smp.h> #include <asm/cacheflush.h> #include <asm/cachetype.h> @@ -27,6 +28,51 @@ #include "mm.h" +static void flush_tlb_local(void *info) +{ + asm volatile("\n" + " tlbi vmalle1\n" + " isb sy" + ); +} + +static void flush_tlb_mm_local(void *info) +{ + unsigned long asid = (unsigned long)info; + + asm volatile("\n" + " tlbi aside1, %0\n" + " isb sy" + : : "r" (asid) + ); +} + +void flush_tlb_all(void) +{ + /* Make sure page table modifications are visible. */ + dsb(ishst); + /* IPI to all CPUs to do local flush. */ + on_each_cpu(flush_tlb_local, NULL, 1); + +} +EXPORT_SYMBOL(flush_tlb_all); + +void flush_tlb_mm(struct mm_struct *mm) +{ + if (!mm) { + flush_tlb_all(); + } else { + unsigned long asid = (unsigned long)ASID(mm) << 48; + /* Make sure page table modifications are visible. */ + dsb(ishst); + /* IPI to all CPUs to do local flush. */ + on_each_cpu_mask(mm_cpumask(mm), + flush_tlb_mm_local, (void *)asid, 1); + } + +} +EXPORT_SYMBOL(flush_tlb_mm); + void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) {