diff mbox

[v1] ARM: keep __my_cpu_offset consistent with generic one

Message ID 1362663356-21151-1-git-send-email-tom.leiming@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei March 7, 2013, 1:35 p.m. UTC
Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access)
introduces arm's __my_cpu_offset to optimize percpu vaiable access,
which really works well on hackbench, but will cause __my_cpu_offset
to return garbage value before it is initialized in cpu_init() called
by setup_arch, so accessing percpu variable before setup_arch may cause
kernel hang. But generic __my_cpu_offset always returns zero before
percpu area is brought up, and won't hang kernel.

So the patch tries to clear __my_cpu_offset on boot CPU early
to avoid boot hang.

At least now percpu variable is accessed by lockdep before
setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP
can trigger kernel hang.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
V1:
	- documents lockdep uses percpu variable early

 arch/arm/kernel/setup.c |    7 +++++++
 1 file changed, 7 insertions(+)

Comments

Ming Lei March 12, 2013, 2:32 a.m. UTC | #1
On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access)
> introduces arm's __my_cpu_offset to optimize percpu vaiable access,
> which really works well on hackbench, but will cause __my_cpu_offset
> to return garbage value before it is initialized in cpu_init() called
> by setup_arch, so accessing percpu variable before setup_arch may cause
> kernel hang. But generic __my_cpu_offset always returns zero before
> percpu area is brought up, and won't hang kernel.
>
> So the patch tries to clear __my_cpu_offset on boot CPU early
> to avoid boot hang.
>
> At least now percpu variable is accessed by lockdep before
> setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP
> can trigger kernel hang.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Rob Herring <rob.herring@calxeda.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Nicolas Pitre <nico@linaro.org>
> Cc: Russell King <rmk+kernel@arm.linux.org.uk>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
> V1:
>         - documents lockdep uses percpu variable early

Looks no one objects the patch, so I has submitted it into Russell's
patch system, and hope it can be pushed to linus tree soon and
make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7.


Thanks,
Russell King - ARM Linux March 12, 2013, 10:56 a.m. UTC | #2
On Tue, Mar 12, 2013 at 10:32:21AM +0800, Ming Lei wrote:
> On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> > Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access)
> > introduces arm's __my_cpu_offset to optimize percpu vaiable access,
> > which really works well on hackbench, but will cause __my_cpu_offset
> > to return garbage value before it is initialized in cpu_init() called
> > by setup_arch, so accessing percpu variable before setup_arch may cause
> > kernel hang. But generic __my_cpu_offset always returns zero before
> > percpu area is brought up, and won't hang kernel.
> >
> > So the patch tries to clear __my_cpu_offset on boot CPU early
> > to avoid boot hang.
> >
> > At least now percpu variable is accessed by lockdep before
> > setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP
> > can trigger kernel hang.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Rob Herring <rob.herring@calxeda.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Nicolas Pitre <nico@linaro.org>
> > Cc: Russell King <rmk+kernel@arm.linux.org.uk>
> > Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> > ---
> > V1:
> >         - documents lockdep uses percpu variable early
> 
> Looks no one objects the patch, so I has submitted it into Russell's
> patch system, and hope it can be pushed to linus tree soon and
> make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7.

I'm not convinced it is correct.  Is the percpu data as stored in the
kernel image (in other words, at offset zero) supposed to be read only?
If so, the above means that we'll be accessing that rather than the
copy of the percpu data we should be accessing.

The percpu data areas are allocated by setup_per_cpu_areas() - that's
where we should be initializing this, just like it's done on PowerPC.
Ming Lei March 12, 2013, 11:25 a.m. UTC | #3
On Tue, Mar 12, 2013 at 6:56 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>>
>> Looks no one objects the patch, so I has submitted it into Russell's
>> patch system, and hope it can be pushed to linus tree soon and
>> make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7.
>
> I'm not convinced it is correct.  Is the percpu data as stored in the
> kernel image (in other words, at offset zero) supposed to be read only?

It should have been used after setup_per_cpu_areas().

> If so, the above means that we'll be accessing that rather than the
> copy of the percpu data we should be accessing.

I admit the patch is a work around for the problem, but it is harmless
to make lockdep workable on arm at least.

> The percpu data areas are allocated by setup_per_cpu_areas() - that's
> where we should be initializing this, just like it's done on PowerPC.

From the entry of start_kernel to setup_per_cpu_areas, there are many
locks which will be acquired/released, so the percpu variable in lock_release
has to be used early now.  Either disabling lockdep during the period or
introducing stupid/simple percpu variable inside lockdep may fix the probem,
but looks both aren't perfect.

So the workaround is proposed in this patch...

Ingo and Peter, what is your opinion on the problem?

Thanks
Russell King - ARM Linux March 12, 2013, 11:30 a.m. UTC | #4
On Tue, Mar 12, 2013 at 07:25:28PM +0800, Ming Lei wrote:
> On Tue, Mar 12, 2013 at 6:56 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> >>
> >> Looks no one objects the patch, so I has submitted it into Russell's
> >> patch system, and hope it can be pushed to linus tree soon and
> >> make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7.
> >
> > I'm not convinced it is correct.  Is the percpu data as stored in the
> > kernel image (in other words, at offset zero) supposed to be read only?
> 
> It should have been used after setup_per_cpu_areas().
> 
> > If so, the above means that we'll be accessing that rather than the
> > copy of the percpu data we should be accessing.
> 
> I admit the patch is a work around for the problem, but it is harmless
> to make lockdep workable on arm at least.
> 
> > The percpu data areas are allocated by setup_per_cpu_areas() - that's
> > where we should be initializing this, just like it's done on PowerPC.
> 
> >From the entry of start_kernel to setup_per_cpu_areas, there are many
> locks which will be acquired/released, so the percpu variable in lock_release
> has to be used early now.  Either disabling lockdep during the period or
> introducing stupid/simple percpu variable inside lockdep may fix the probem,
> but looks both aren't perfect.
> 
> So the workaround is proposed in this patch...
> 
> Ingo and Peter, what is your opinion on the problem?

Having discussed this with Ben Herrenschmidt, it seems that we do need
to have a more complex patch to sort this out - we need to setup our
private pointer inside setup_per_cpu_areas().
Ming Lei March 12, 2013, 11:44 a.m. UTC | #5
On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>>
>> Ingo and Peter, what is your opinion on the problem?
>
> Having discussed this with Ben Herrenschmidt, it seems that we do need
> to have a more complex patch to sort this out - we need to setup our
> private pointer inside setup_per_cpu_areas().

Suppose so, seems the patch is still needed to make CPU0 see
static variables in '.data..percpu' section correctly.


Thanks,
Tejun Heo March 12, 2013, 5:25 p.m. UTC | #6
Hello,

On Tue, Mar 12, 2013 at 07:44:55PM +0800, Ming Lei wrote:
> On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> >>
> >> Ingo and Peter, what is your opinion on the problem?
> >
> > Having discussed this with Ben Herrenschmidt, it seems that we do need
> > to have a more complex patch to sort this out - we need to setup our
> > private pointer inside setup_per_cpu_areas().
> 
> Suppose so, seems the patch is still needed to make CPU0 see
> static variables in '.data..percpu' section correctly.

If my memory serves me right, x86 also has places where CPU0 accesses
its per-cpu data in .data.percpu.  While those existed (not sure
they're still there but probably they're) mostly due to historical
reasons rather than by design, as long as the data is in consistent
state by and during percpu setup, nothing will break.

Thanks.
Ming Lei March 13, 2013, 12:57 a.m. UTC | #7
On Wed, Mar 13, 2013 at 1:25 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Mar 12, 2013 at 07:44:55PM +0800, Ming Lei wrote:
>> On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> >>
>> >> Ingo and Peter, what is your opinion on the problem?
>> >
>> > Having discussed this with Ben Herrenschmidt, it seems that we do need
>> > to have a more complex patch to sort this out - we need to setup our
>> > private pointer inside setup_per_cpu_areas().
>>
>> Suppose so, seems the patch is still needed to make CPU0 see
>> static variables in '.data..percpu' section correctly.
>
> If my memory serves me right, x86 also has places where CPU0 accesses
> its per-cpu data in .data.percpu.  While those existed (not sure
> they're still there but probably they're) mostly due to historical
> reasons rather than by design, as long as the data is in consistent
> state by and during percpu setup, nothing will break.

Tejun, thanks for your input, yes, nothing will break.  For lockdep,
the percpu variables in non-boot-CPUs may be initialized again after
percpu area is set up.


Thanks,
Russell King - ARM Linux April 3, 2013, 4:19 p.m. UTC | #8
On Tue, Mar 12, 2013 at 10:56:38AM +0000, Russell King - ARM Linux wrote:
> On Tue, Mar 12, 2013 at 10:32:21AM +0800, Ming Lei wrote:
> > On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> > > Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access)
> > > introduces arm's __my_cpu_offset to optimize percpu vaiable access,
> > > which really works well on hackbench, but will cause __my_cpu_offset
> > > to return garbage value before it is initialized in cpu_init() called
> > > by setup_arch, so accessing percpu variable before setup_arch may cause
> > > kernel hang. But generic __my_cpu_offset always returns zero before
> > > percpu area is brought up, and won't hang kernel.
> > >
> > > So the patch tries to clear __my_cpu_offset on boot CPU early
> > > to avoid boot hang.
> > >
> > > At least now percpu variable is accessed by lockdep before
> > > setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP
> > > can trigger kernel hang.
> > >
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Rob Herring <rob.herring@calxeda.com>
> > > Cc: Will Deacon <will.deacon@arm.com>
> > > Cc: Nicolas Pitre <nico@linaro.org>
> > > Cc: Russell King <rmk+kernel@arm.linux.org.uk>
> > > Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> > > ---
> > > V1:
> > >         - documents lockdep uses percpu variable early
> > 
> > Looks no one objects the patch, so I has submitted it into Russell's
> > patch system, and hope it can be pushed to linus tree soon and
> > make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7.
> 
> I'm not convinced it is correct.  Is the percpu data as stored in the
> kernel image (in other words, at offset zero) supposed to be read only?
> If so, the above means that we'll be accessing that rather than the
> copy of the percpu data we should be accessing.
> 
> The percpu data areas are allocated by setup_per_cpu_areas() - that's
> where we should be initializing this, just like it's done on PowerPC.

Still not convinced this is a proper fix.  Look, the problem is this:

- Initially, set the CPU percpu offset to zero.  This means the boot
  CPU reads and writes to the percpu data section in the kernel image.

- The percpu areas are initialized, and the percpu data copied to each
  percpu data - this will have any writes from the boot CPU included as
  changes to the percpu data.

- The boot CPU continues to read/write to the percpu data section.

- If the boot CPU suspends/resumes, cpu_init() gets called, which will
  call set_my_cpu_offset(per_cpu_offset(cpu)); for the boot CPU.

- The boot CPU now uses the allocated percpu data section and any
  updates it did in the percpu data section in the kernel image are
  lost to it.

Your patch may be right on its own to solve the initial problem, but
it leaves a _big_ hole.

Now, the big question here: is it right that the boot CPU should ever
write to the static percpu data section in the kernel image?  What if
there's a pointer in there, initially NULL, which then gets checked
by each CPU and initialized if NULL - we'll end up sharing the same
allocation amongst all CPUs, which probably isn't what was intended.
If there's a list_head which gets added to, that too will be very bad.

Although you have uncovered a problem, I still think by setting the
offset to zero initially, you're just papering over a much bigger
can of worms.

So, should percpu data be used this early in boot before the percpu
stuff is properly initialized?  That feels _extremely_ unsafe.

This, I think, needs to be addressed properly.  And part of that is
knowing where things went wrong.  Will Deacon asked you for a backtrace
showing where this problem occured.  Your response seems to be to
resend the patch with a "v1" tag a no new information.

Sorry, not applying this until the above issue has been discussed
and the location of these percpu accesses has been properly analysed.
Ming Lei April 4, 2013, 6:31 a.m. UTC | #9
Hi,

On Thu, Apr 4, 2013 at 12:19 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>
> Still not convinced this is a proper fix.  Look, the problem is this:
>
> - Initially, set the CPU percpu offset to zero.  This means the boot
>   CPU reads and writes to the percpu data section in the kernel image.
>
> - The percpu areas are initialized, and the percpu data copied to each
>   percpu data - this will have any writes from the boot CPU included as
>   changes to the percpu data.
>
> - The boot CPU continues to read/write to the percpu data section.

No, the boot CPU switches to access percpu area after the area is
created(after setup_per_cpu_areas() and smp_prepare_boot_cpu()).

>
> - If the boot CPU suspends/resumes, cpu_init() gets called, which will
>   call set_my_cpu_offset(per_cpu_offset(cpu)); for the boot CPU.
>
> - The boot CPU now uses the allocated percpu data section and any
>   updates it did in the percpu data section in the kernel image are
>   lost to it.
>
> Your patch may be right on its own to solve the initial problem, but
> it leaves a _big_ hole.

Without the patch, looks the hols was still here, at least before commit
14318efb(ARM: 7587/1: implement optimized percpu variable access),
the per_cpu_offset() always return 0 before percpu area is created on ARM.
Same on other ARCHs too.

>
> Now, the big question here: is it right that the boot CPU should ever
> write to the static percpu data section in the kernel image?  What if

IMO, it isn't right, but as Tejun said, "it mostly due to historical
reasons rather than by design, as long as the data is in consistent
state by and during percpu setup, nothing will break."

As far as the lockdep case, it is a bit special since lock can
be used very early, and surely more early than creating percpu
area.

> there's a pointer in there, initially NULL, which then gets checked
> by each CPU and initialized if NULL - we'll end up sharing the same
> allocation amongst all CPUs, which probably isn't what was intended.
> If there's a list_head which gets added to, that too will be very bad.

Once lockstat is disabled, arm can boot well, that means no others might
access percpu data section early, so could we just treat it as a special case?

>
> Although you have uncovered a problem, I still think by setting the
> offset to zero initially, you're just papering over a much bigger
> can of worms.
>
> So, should percpu data be used this early in boot before the percpu
> stuff is properly initialized?  That feels _extremely_ unsafe.
>
> This, I think, needs to be addressed properly.  And part of that is
> knowing where things went wrong.  Will Deacon asked you for a backtrace
> showing where this problem occured.  Your response seems to be to
> resend the patch with a "v1" tag a no new information.

I have explained to Will Deacon, and looks no oops trace is generated
on the memory access exception during booting. And the hang
happened in mutex_unlock(&cgroup_mutex)(cgroup_init_subsys<-
cgroup_init_early).

> Sorry, not applying this until the above issue has been discussed
> and the location of these percpu accesses has been properly analysed.

No problem.

Thanks,
diff mbox

Patch

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 3f6cbb2..1a9dc21 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -439,6 +439,13 @@  void __init smp_setup_processor_id(void)
 	for (i = 1; i < nr_cpu_ids; ++i)
 		cpu_logical_map(i) = i == cpu ? 0 : i;
 
+	/*
+	 * clear __my_cpu_offset on boot CPU to avoid hang caused by
+	 * using percpu variable early, for example, lockdep will
+	 * access percpu variable inside lock_release
+	 */
+	set_my_cpu_offset(0);
+
 	printk(KERN_INFO "Booting Linux on physical CPU 0x%x\n", mpidr);
 }