Message ID | 1354200764-23751-1-git-send-email-robherring2@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Rob, Thanks for the v2. One comment inline... On Thu, Nov 29, 2012 at 02:52:44PM +0000, Rob Herring wrote: > From: Rob Herring <rob.herring@calxeda.com> > > Use the previously unused TPIDRPRW register to store percpu offsets. > TPIDRPRW is only accessible in PL1, so it can only be used in the kernel. > > This replaces 2 loads with a mrc instruction for each percpu variable > access. With hackbench, the performance improvement is 1.4% on Cortex-A9 > (highbank). Taking an average of 30 runs of "hackbench -l 1000" yields: > > Before: 6.2191 > After: 6.1348 > > Will Deacon reported similar delta on v6 with 11MPCore. > > The asm "memory" constraints are needed here to ensure the percpu offset > gets reloaded. Testing by Will found that this would not happen in > __schedule() which is a bit of a special case as preemption is disabled > but the execution can move cores. > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > Acked-by: Will Deacon <will.deacon@arm.com> > --- > Changes in v2: > - Add asm "memory" constraint > - Only enable on v6K and v7 and avoid enabling for v6 SMP_ON_UP > - Fix missing initialization of TPIDRPRW for resume path > - Move cpu_init to beginning of secondary_start_kernel to ensure percpu > variables can be accessed as early as possible. [...] > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > index fbc8b26..aadcca7 100644 > --- a/arch/arm/kernel/smp.c > +++ b/arch/arm/kernel/smp.c > @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > struct mm_struct *mm = &init_mm; > unsigned int cpu; > > + cpu_init(); > + > /* > * The identity mapping is uncached (strongly ordered), so > * switch away from it before attempting any exclusive accesses. > @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > > printk("CPU%u: Booted secondary processor\n", cpu); > > - cpu_init(); > preempt_disable(); > trace_hardirqs_off(); It's really not safe moving the cpu_init that early because we're running strongly ordered at that point, so locks aren't working. Will
On Thu, Nov 29, 2012 at 03:05:09PM +0000, Will Deacon wrote: > It's really not safe moving the cpu_init that early because we're running > strongly ordered at that point, so locks aren't working. Maybe just move it before the printk?
On 11/29/2012 09:05 AM, Will Deacon wrote: > Hi Rob, > > Thanks for the v2. One comment inline... > > On Thu, Nov 29, 2012 at 02:52:44PM +0000, Rob Herring wrote: >> From: Rob Herring <rob.herring@calxeda.com> >> >> Use the previously unused TPIDRPRW register to store percpu offsets. >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel. >> >> This replaces 2 loads with a mrc instruction for each percpu variable >> access. With hackbench, the performance improvement is 1.4% on Cortex-A9 >> (highbank). Taking an average of 30 runs of "hackbench -l 1000" yields: >> >> Before: 6.2191 >> After: 6.1348 >> >> Will Deacon reported similar delta on v6 with 11MPCore. >> >> The asm "memory" constraints are needed here to ensure the percpu offset >> gets reloaded. Testing by Will found that this would not happen in >> __schedule() which is a bit of a special case as preemption is disabled >> but the execution can move cores. >> >> Signed-off-by: Rob Herring <rob.herring@calxeda.com> >> Acked-by: Will Deacon <will.deacon@arm.com> >> --- >> Changes in v2: >> - Add asm "memory" constraint >> - Only enable on v6K and v7 and avoid enabling for v6 SMP_ON_UP >> - Fix missing initialization of TPIDRPRW for resume path >> - Move cpu_init to beginning of secondary_start_kernel to ensure percpu >> variables can be accessed as early as possible. > > [...] > >> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c >> index fbc8b26..aadcca7 100644 >> --- a/arch/arm/kernel/smp.c >> +++ b/arch/arm/kernel/smp.c >> @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) >> struct mm_struct *mm = &init_mm; >> unsigned int cpu; >> >> + cpu_init(); >> + >> /* >> * The identity mapping is uncached (strongly ordered), so >> * switch away from it before attempting any exclusive accesses. >> @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) >> >> printk("CPU%u: Booted secondary processor\n", cpu); >> >> - cpu_init(); >> preempt_disable(); >> trace_hardirqs_off(); > > It's really not safe moving the cpu_init that early because we're running > strongly ordered at that point, so locks aren't working. I'll drop this hunk. Nicolas had concerns, but I've checked all the functions prior to cpu_init and don't see any percpu variable accesses. It could be an issue if we made current be a percpu var like x86-64, but I don't see any advantage to doing that. Rob
On Thu, Nov 29, 2012 at 03:13:29PM +0000, Rob Herring wrote: > On 11/29/2012 09:05 AM, Will Deacon wrote: > > On Thu, Nov 29, 2012 at 02:52:44PM +0000, Rob Herring wrote: > >> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > >> index fbc8b26..aadcca7 100644 > >> --- a/arch/arm/kernel/smp.c > >> +++ b/arch/arm/kernel/smp.c > >> @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > >> struct mm_struct *mm = &init_mm; > >> unsigned int cpu; > >> > >> + cpu_init(); > >> + > >> /* > >> * The identity mapping is uncached (strongly ordered), so > >> * switch away from it before attempting any exclusive accesses. > >> @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > >> > >> printk("CPU%u: Booted secondary processor\n", cpu); > >> > >> - cpu_init(); > >> preempt_disable(); > >> trace_hardirqs_off(); > > > > It's really not safe moving the cpu_init that early because we're running > > strongly ordered at that point, so locks aren't working. > > I'll drop this hunk. Nicolas had concerns, but I've checked all the > functions prior to cpu_init and don't see any percpu variable accesses. > It could be an issue if we made current be a percpu var like x86-64, but > I don't see any advantage to doing that. I dunno, moving it before printk as Russell suggested is probably a good idea as that code certainly isn't trivial and could easily tickle per-cpu accesses in certain configurations. Will
On Thu, 29 Nov 2012, Rob Herring wrote: > From: Rob Herring <rob.herring@calxeda.com> > > Use the previously unused TPIDRPRW register to store percpu offsets. > TPIDRPRW is only accessible in PL1, so it can only be used in the kernel. > > This replaces 2 loads with a mrc instruction for each percpu variable > access. With hackbench, the performance improvement is 1.4% on Cortex-A9 > (highbank). Taking an average of 30 runs of "hackbench -l 1000" yields: > > Before: 6.2191 > After: 6.1348 > > Will Deacon reported similar delta on v6 with 11MPCore. > > The asm "memory" constraints are needed here to ensure the percpu offset > gets reloaded. Testing by Will found that this would not happen in > __schedule() which is a bit of a special case as preemption is disabled > but the execution can move cores. Strictly speaking, you should say "memory clobber" and not "memory constraint" in this case. > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > Acked-by: Will Deacon <will.deacon@arm.com> With the above, and moving the call to cpu_init() after the call to cpu_switch_mm(mm->pgd, mm) to fix Will's concerns (personally I'd put it right after local_flush_tlb_all())... Acked-by: Nicolas Pitre <nico@linaro.org> > --- > Changes in v2: > - Add asm "memory" constraint > - Only enable on v6K and v7 and avoid enabling for v6 SMP_ON_UP > - Fix missing initialization of TPIDRPRW for resume path > - Move cpu_init to beginning of secondary_start_kernel to ensure percpu > variables can be accessed as early as possible. > > arch/arm/include/asm/Kbuild | 1 - > arch/arm/include/asm/percpu.h | 45 +++++++++++++++++++++++++++++++++++++++++ > arch/arm/kernel/setup.c | 6 ++++++ > arch/arm/kernel/smp.c | 4 +++- > 4 files changed, 54 insertions(+), 2 deletions(-) > create mode 100644 arch/arm/include/asm/percpu.h > > diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild > index f70ae17..2ffdaac 100644 > --- a/arch/arm/include/asm/Kbuild > +++ b/arch/arm/include/asm/Kbuild > @@ -16,7 +16,6 @@ generic-y += local64.h > generic-y += msgbuf.h > generic-y += param.h > generic-y += parport.h > -generic-y += percpu.h > generic-y += poll.h > generic-y += resource.h > generic-y += sections.h > diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h > new file mode 100644 > index 0000000..968c0a1 > --- /dev/null > +++ b/arch/arm/include/asm/percpu.h > @@ -0,0 +1,45 @@ > +/* > + * Copyright 2012 Calxeda, Inc. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + * You should have received a copy of the GNU General Public License along with > + * this program. If not, see <http://www.gnu.org/licenses/>. > + */ > +#ifndef _ASM_ARM_PERCPU_H_ > +#define _ASM_ARM_PERCPU_H_ > + > +/* > + * Same as asm-generic/percpu.h, except that we store the per cpu offset > + * in the TPIDRPRW. TPIDRPRW only exists on V6K and V7 > + */ > +#if defined(CONFIG_SMP) && !defined(CONFIG_CPU_V6) > +static inline void set_my_cpu_offset(unsigned long off) > +{ > + /* Set TPIDRPRW */ > + asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory"); > +} > + > +static inline unsigned long __my_cpu_offset(void) > +{ > + unsigned long off; > + /* Read TPIDRPRW */ > + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); > + return off; > +} > +#define __my_cpu_offset __my_cpu_offset() > +#else > +#define set_my_cpu_offset(x) do {} while(0) > + > +#endif /* CONFIG_SMP */ > + > +#include <asm-generic/percpu.h> > + > +#endif /* _ASM_ARM_PERCPU_H_ */ > diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c > index da1d1aa..b2909ce 100644 > --- a/arch/arm/kernel/setup.c > +++ b/arch/arm/kernel/setup.c > @@ -383,6 +383,12 @@ void cpu_init(void) > BUG(); > } > > + /* > + * This only works on resume and secondary cores. For booting on the > + * boot cpu, smp_prepare_boot_cpu is called after percpu area setup. > + */ > + set_my_cpu_offset(per_cpu_offset(cpu)); > + > cpu_proc_init(); > > /* > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > index fbc8b26..aadcca7 100644 > --- a/arch/arm/kernel/smp.c > +++ b/arch/arm/kernel/smp.c > @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > struct mm_struct *mm = &init_mm; > unsigned int cpu; > > + cpu_init(); > + > /* > * The identity mapping is uncached (strongly ordered), so > * switch away from it before attempting any exclusive accesses. > @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > > printk("CPU%u: Booted secondary processor\n", cpu); > > - cpu_init(); > preempt_disable(); > trace_hardirqs_off(); > > @@ -371,6 +372,7 @@ void __init smp_cpus_done(unsigned int max_cpus) > > void __init smp_prepare_boot_cpu(void) > { > + set_my_cpu_offset(per_cpu_offset(smp_processor_id())); > } > > void __init smp_prepare_cpus(unsigned int max_cpus) > -- > 1.7.10.4 >
On Thu, 29 Nov 2012, Rob Herring wrote: > > > On 11/29/2012 09:05 AM, Will Deacon wrote: > > Hi Rob, > > > > Thanks for the v2. One comment inline... > > > > On Thu, Nov 29, 2012 at 02:52:44PM +0000, Rob Herring wrote: > >> From: Rob Herring <rob.herring@calxeda.com> > >> > >> Use the previously unused TPIDRPRW register to store percpu offsets. > >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel. > >> > >> This replaces 2 loads with a mrc instruction for each percpu variable > >> access. With hackbench, the performance improvement is 1.4% on Cortex-A9 > >> (highbank). Taking an average of 30 runs of "hackbench -l 1000" yields: > >> > >> Before: 6.2191 > >> After: 6.1348 > >> > >> Will Deacon reported similar delta on v6 with 11MPCore. > >> > >> The asm "memory" constraints are needed here to ensure the percpu offset > >> gets reloaded. Testing by Will found that this would not happen in > >> __schedule() which is a bit of a special case as preemption is disabled > >> but the execution can move cores. > >> > >> Signed-off-by: Rob Herring <rob.herring@calxeda.com> > >> Acked-by: Will Deacon <will.deacon@arm.com> > >> --- > >> Changes in v2: > >> - Add asm "memory" constraint > >> - Only enable on v6K and v7 and avoid enabling for v6 SMP_ON_UP > >> - Fix missing initialization of TPIDRPRW for resume path > >> - Move cpu_init to beginning of secondary_start_kernel to ensure percpu > >> variables can be accessed as early as possible. > > > > [...] > > > >> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > >> index fbc8b26..aadcca7 100644 > >> --- a/arch/arm/kernel/smp.c > >> +++ b/arch/arm/kernel/smp.c > >> @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > >> struct mm_struct *mm = &init_mm; > >> unsigned int cpu; > >> > >> + cpu_init(); > >> + > >> /* > >> * The identity mapping is uncached (strongly ordered), so > >> * switch away from it before attempting any exclusive accesses. > >> @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) > >> > >> printk("CPU%u: Booted secondary processor\n", cpu); > >> > >> - cpu_init(); > >> preempt_disable(); > >> trace_hardirqs_off(); > > > > It's really not safe moving the cpu_init that early because we're running > > strongly ordered at that point, so locks aren't working. > > I'll drop this hunk. Nicolas had concerns, but I've checked all the > functions prior to cpu_init and don't see any percpu variable accesses. Be defensive in your code my friend. It is not because there is none today that this won't happen ever. > It could be an issue if we made current be a percpu var like x86-64, but > I don't see any advantage to doing that. Maybe someone else will have another use for it. You never know. Nicolas
On Thu, Nov 29, 2012 at 10:45:50AM -0500, Nicolas Pitre wrote: > On Thu, 29 Nov 2012, Rob Herring wrote: > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > > Acked-by: Will Deacon <will.deacon@arm.com> > > With the above, and moving the call to cpu_init() after the call to > cpu_switch_mm(mm->pgd, mm) to fix Will's concerns (personally I'd put it > right after local_flush_tlb_all())... You're confused. We were suggesting before the printk(). The reasoning is: printk() is not guaranteed not to access per-cpu variables, so it needs to be before the first printk. It can't be before cpu_switch_mm(), and putting it before the TLB flush does _not_ guarantee that TLB/MMU isn't going to still be seeing the strongly-ordered attribute - so it _must_ be after the TLB flush. As for the setup of the active MM, that's something that I still think should come as early as possible (even before the per-cpu stuff) because that is getting everything properly initialized and setup for this new thread; there's a risk that a fault occuring before that point may cause issues, specially if active_mm were NULL. So, between cpumask_set_cpu() and printk() please.
On Thu, 29 Nov 2012, Russell King - ARM Linux wrote: > On Thu, Nov 29, 2012 at 10:45:50AM -0500, Nicolas Pitre wrote: > > On Thu, 29 Nov 2012, Rob Herring wrote: > > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > > > Acked-by: Will Deacon <will.deacon@arm.com> > > > > With the above, and moving the call to cpu_init() after the call to > > cpu_switch_mm(mm->pgd, mm) to fix Will's concerns (personally I'd put it > > right after local_flush_tlb_all())... > > You're confused. We were suggesting before the printk(). > > The reasoning is: printk() is not guaranteed not to access per-cpu > variables, so it needs to be before the first printk. It can't be > before cpu_switch_mm(), and putting it before the TLB flush does > _not_ guarantee that TLB/MMU isn't going to still be seeing the > strongly-ordered attribute - so it _must_ be after the TLB flush. But isn't that what I wrote above? I said "personally I'd put it right after local_flush_tlb_all()". > As for the setup of the active MM, that's something that I still > think should come as early as possible (even before the per-cpu > stuff) because that is getting everything properly initialized and > setup for this new thread; there's a risk that a fault occuring > before that point may cause issues, specially if active_mm were NULL. Good point. > So, between cpumask_set_cpu() and printk() please. Agreed. Nicolas
On Thu, Nov 29, 2012 at 11:02:39AM -0500, Nicolas Pitre wrote: > On Thu, 29 Nov 2012, Russell King - ARM Linux wrote: > > > On Thu, Nov 29, 2012 at 10:45:50AM -0500, Nicolas Pitre wrote: > > > On Thu, 29 Nov 2012, Rob Herring wrote: > > > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > > > > Acked-by: Will Deacon <will.deacon@arm.com> > > > > > > With the above, and moving the call to cpu_init() after the call to > > > cpu_switch_mm(mm->pgd, mm) to fix Will's concerns (personally I'd put it > > > right after local_flush_tlb_all())... > > > > You're confused. We were suggesting before the printk(). > > > > The reasoning is: printk() is not guaranteed not to access per-cpu > > variables, so it needs to be before the first printk. It can't be > > before cpu_switch_mm(), and putting it before the TLB flush does > > _not_ guarantee that TLB/MMU isn't going to still be seeing the > > strongly-ordered attribute - so it _must_ be after the TLB flush. > > But isn't that what I wrote above? I said "personally I'd put it right > after local_flush_tlb_all()". No it isn't. No one was suggesting putting it after cpu_switch_mm() in this thread other than your statement above. _That_ is what provoked me into replying because it was wrong and needed to be corrected before we got a patch doing exactly that.
* Rob Herring <robherring2@gmail.com> [121129 06:55]: > From: Rob Herring <rob.herring@calxeda.com> > > Use the previously unused TPIDRPRW register to store percpu offsets. > TPIDRPRW is only accessible in PL1, so it can only be used in the kernel. > > This replaces 2 loads with a mrc instruction for each percpu variable > access. With hackbench, the performance improvement is 1.4% on Cortex-A9 > (highbank). Taking an average of 30 runs of "hackbench -l 1000" yields: > > Before: 6.2191 > After: 6.1348 > > Will Deacon reported similar delta on v6 with 11MPCore. > > The asm "memory" constraints are needed here to ensure the percpu offset > gets reloaded. Testing by Will found that this would not happen in > __schedule() which is a bit of a special case as preemption is disabled > but the execution can move cores. > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > Acked-by: Will Deacon <will.deacon@arm.com> > --- > Changes in v2: > - Add asm "memory" constraint > - Only enable on v6K and v7 and avoid enabling for v6 SMP_ON_UP Thanks, seems to still boot on omap2 with omap2plus_defconfig. Once the other comments are sorted out: Acked-by: Tony Lindgren <tony@atomide.com>
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild index f70ae17..2ffdaac 100644 --- a/arch/arm/include/asm/Kbuild +++ b/arch/arm/include/asm/Kbuild @@ -16,7 +16,6 @@ generic-y += local64.h generic-y += msgbuf.h generic-y += param.h generic-y += parport.h -generic-y += percpu.h generic-y += poll.h generic-y += resource.h generic-y += sections.h diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h new file mode 100644 index 0000000..968c0a1 --- /dev/null +++ b/arch/arm/include/asm/percpu.h @@ -0,0 +1,45 @@ +/* + * Copyright 2012 Calxeda, Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program. If not, see <http://www.gnu.org/licenses/>. + */ +#ifndef _ASM_ARM_PERCPU_H_ +#define _ASM_ARM_PERCPU_H_ + +/* + * Same as asm-generic/percpu.h, except that we store the per cpu offset + * in the TPIDRPRW. TPIDRPRW only exists on V6K and V7 + */ +#if defined(CONFIG_SMP) && !defined(CONFIG_CPU_V6) +static inline void set_my_cpu_offset(unsigned long off) +{ + /* Set TPIDRPRW */ + asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory"); +} + +static inline unsigned long __my_cpu_offset(void) +{ + unsigned long off; + /* Read TPIDRPRW */ + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); + return off; +} +#define __my_cpu_offset __my_cpu_offset() +#else +#define set_my_cpu_offset(x) do {} while(0) + +#endif /* CONFIG_SMP */ + +#include <asm-generic/percpu.h> + +#endif /* _ASM_ARM_PERCPU_H_ */ diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index da1d1aa..b2909ce 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -383,6 +383,12 @@ void cpu_init(void) BUG(); } + /* + * This only works on resume and secondary cores. For booting on the + * boot cpu, smp_prepare_boot_cpu is called after percpu area setup. + */ + set_my_cpu_offset(per_cpu_offset(cpu)); + cpu_proc_init(); /* diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index fbc8b26..aadcca7 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -296,6 +296,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void) struct mm_struct *mm = &init_mm; unsigned int cpu; + cpu_init(); + /* * The identity mapping is uncached (strongly ordered), so * switch away from it before attempting any exclusive accesses. @@ -315,7 +317,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void) printk("CPU%u: Booted secondary processor\n", cpu); - cpu_init(); preempt_disable(); trace_hardirqs_off(); @@ -371,6 +372,7 @@ void __init smp_cpus_done(unsigned int max_cpus) void __init smp_prepare_boot_cpu(void) { + set_my_cpu_offset(per_cpu_offset(smp_processor_id())); } void __init smp_prepare_cpus(unsigned int max_cpus)