Message ID | 20190822004644.25829-2-atish.patra@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Optimize tlbflush path | expand |
On Wed, Aug 21, 2019 at 05:46:42PM -0700, Atish Patra wrote: > In RISC-V, tlb flush happens via SBI which is expensive. If the local > cpu is the only cpu in cpumask, there is no need to invoke a SBI call. > > Just do a local flush and return. > > Signed-off-by: Atish Patra <atish.patra@wdc.com> > --- > arch/riscv/mm/tlbflush.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c > index df93b26f1b9d..36430ee3bed9 100644 > --- a/arch/riscv/mm/tlbflush.c > +++ b/arch/riscv/mm/tlbflush.c > @@ -2,6 +2,7 @@ > > #include <linux/mm.h> > #include <linux/smp.h> > +#include <linux/sched.h> > #include <asm/sbi.h> > > void flush_tlb_all(void) > @@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start, > unsigned long size) > { > struct cpumask hmask; > + unsigned int cpuid = get_cpu(); > > + if (!cmask) { > + riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask); > + goto issue_sfence; > + } > + > + if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) { > + local_flush_tlb_all(); > + goto done; > + } I think a single core on a SMP kernel is a valid enough use case given how litte distros still have UP kernels. So Maybe this shiuld rather be: if (!cmask) cmask = cpu_online_mask; if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) { local_flush_tlb_all(); } else { riscv_cpuid_to_hartid_mask(cmask, &hmask); sbi_remote_sfence_vma(hmask.bits, start, size); }
On Thu, 2019-08-22 at 03:46 +0200, Christoph Hellwig wrote: > On Wed, Aug 21, 2019 at 05:46:42PM -0700, Atish Patra wrote: > > In RISC-V, tlb flush happens via SBI which is expensive. If the > > local > > cpu is the only cpu in cpumask, there is no need to invoke a SBI > > call. > > > > Just do a local flush and return. > > > > Signed-off-by: Atish Patra <atish.patra@wdc.com> > > --- > > arch/riscv/mm/tlbflush.c | 15 +++++++++++++++ > > 1 file changed, 15 insertions(+) > > > > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c > > index df93b26f1b9d..36430ee3bed9 100644 > > --- a/arch/riscv/mm/tlbflush.c > > +++ b/arch/riscv/mm/tlbflush.c > > @@ -2,6 +2,7 @@ > > > > #include <linux/mm.h> > > #include <linux/smp.h> > > +#include <linux/sched.h> > > #include <asm/sbi.h> > > > > void flush_tlb_all(void) > > @@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask > > *cmask, unsigned long start, > > unsigned long size) > > { > > struct cpumask hmask; > > + unsigned int cpuid = get_cpu(); > > > > + if (!cmask) { > > + riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask); > > + goto issue_sfence; > > + } > > + > > + if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == > > 1) { > > + local_flush_tlb_all(); > > + goto done; > > + } > > I think a single core on a SMP kernel is a valid enough use case > given > how litte distros still have UP kernels. So Maybe this shiuld rather > be: > > if (!cmask) > cmask = cpu_online_mask; > > if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == > 1) { > local_flush_tlb_all(); > } else { > riscv_cpuid_to_hartid_mask(cmask, &hmask); > sbi_remote_sfence_vma(hmask.bits, start, size); > } The downside of this is that for every !cmask case in true SMP (more common probably) it will execute 2 extra cpumask instructions. As tlbflush path is in performance critical path, I think we should favor more common case (SMP with more than 1 core). Thoughts ?
On Thu, Aug 22, 2019 at 04:01:24AM +0000, Atish Patra wrote: > The downside of this is that for every !cmask case in true SMP (more > common probably) it will execute 2 extra cpumask instructions. As > tlbflush path is in performance critical path, I think we should favor > more common case (SMP with more than 1 core). Actually, looking at both the current mainline code, and the code from my cleanups tree I don't think remote_sfence_vma / __sbi_tlb_flush_range can ever be called with NULL cpumask, as we always have a valid mm. So this is a bit of a moot point, and we can drop andling that case entirely. With that we can also use a simple if / else for the local cpu only vs remote case. Btw, what was the reason you didn't like using cpumask_any_but like x86, which should be more efficient than cpumask_test_cpu + hweigt?
On Thu, 2019-08-22 at 06:27 +0200, hch@lst.de wrote: > On Thu, Aug 22, 2019 at 04:01:24AM +0000, Atish Patra wrote: > > The downside of this is that for every !cmask case in true SMP > > (more > > common probably) it will execute 2 extra cpumask instructions. As > > tlbflush path is in performance critical path, I think we should > > favor > > more common case (SMP with more than 1 core). > > Actually, looking at both the current mainline code, and the code > from my > cleanups tree I don't think remote_sfence_vma / __sbi_tlb_flush_range > can ever be called with NULL cpumask, as we always have a valid mm. > Yes. You are correct. As both cpumask functions here will crash if cpumask is null, we should probably leave a harmless comment to warn the consequeunce of cpumask being null. > So this is a bit of a moot point, and we can drop andling that case > entirely. With that we can also use a simple if / else for the local > cpu only vs remote case. Done. > Btw, what was the reason you didn't like > using cpumask_any_but like x86, which should be more efficient than > cpumask_test_cpu + hweigt? I had it in v2 patch but removed as it can potentially return garbage value if cpumask is empty. However, we are already checking empty cpumask before the local cpu check. I will replace cpumask_test_cpu + hweight with cpumask_any_but().
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c index df93b26f1b9d..36430ee3bed9 100644 --- a/arch/riscv/mm/tlbflush.c +++ b/arch/riscv/mm/tlbflush.c @@ -2,6 +2,7 @@ #include <linux/mm.h> #include <linux/smp.h> +#include <linux/sched.h> #include <asm/sbi.h> void flush_tlb_all(void) @@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start, unsigned long size) { struct cpumask hmask; + unsigned int cpuid = get_cpu(); + if (!cmask) { + riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask); + goto issue_sfence; + } + + if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) { + local_flush_tlb_all(); + goto done; + } riscv_cpuid_to_hartid_mask(cmask, &hmask); + +issue_sfence: sbi_remote_sfence_vma(hmask.bits, start, size); +done: + put_cpu(); } void flush_tlb_mm(struct mm_struct *mm)
In RISC-V, tlb flush happens via SBI which is expensive. If the local cpu is the only cpu in cpumask, there is no need to invoke a SBI call. Just do a local flush and return. Signed-off-by: Atish Patra <atish.patra@wdc.com> --- arch/riscv/mm/tlbflush.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)