diff mbox

ARM: omap2+: Revert omap-smp.c changes resetting cpu1 during boot

Message ID 20170213215013.19929-1-tony@atomide.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tony Lindgren Feb. 13, 2017, 9:50 p.m. UTC
Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
on omap4 when doing kexec boot between two different kernel versions. The
booted kernel ended up trying to use the old kernel start-up address unless
cpu1 was reset before configuring the cpu1 start-up address.

It seems the reset part was not correct but probably working around some
other issue. I have not been able to reproduce this issue any longer despite
testing with backported patches back to v4.6 kernel. So it is possible this
issue was caused by other work in progress kexec patches I had applied. Or
it is possible some other fixes have made the issue go way.

The unconditional reset of cpu1 can cause issues booting some devices. For
example, bootloader configured secure OS running on cpu1 will fail as the
configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.

Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
still need to reset cpu1 in some cases, we can add it back and do it
conditionally.

Fixes: 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec")
Cc: Keerthy <j-keerthy@ti.com>
Cc: Tero Kristo <t-kristo@ti.com>
Reported-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
---
 arch/arm/mach-omap2/omap-smp.c | 10 ----------
 1 file changed, 10 deletions(-)

Comments

Tony Lindgren Feb. 14, 2017, 7:36 p.m. UTC | #1
* Tony Lindgren <tony@atomide.com> [170213 13:51]:
> Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
> resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
> on omap4 when doing kexec boot between two different kernel versions. The
> booted kernel ended up trying to use the old kernel start-up address unless
> cpu1 was reset before configuring the cpu1 start-up address.
> 
> It seems the reset part was not correct but probably working around some
> other issue. I have not been able to reproduce this issue any longer despite
> testing with backported patches back to v4.6 kernel. So it is possible this
> issue was caused by other work in progress kexec patches I had applied. Or
> it is possible some other fixes have made the issue go way.
> 
> The unconditional reset of cpu1 can cause issues booting some devices. For
> example, bootloader configured secure OS running on cpu1 will fail as the
> configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.
> 
> Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
> still need to reset cpu1 in some cases, we can add it back and do it
> conditionally.

Actually with this I'm now seeing cpu1 not come up after a suspend/resume
cycle on duovero:

[  118.257415] CPU1: shutdown
[  118.294616] Error taking CPU1 up: -2
[  118.299072] PM: noirq resume of devices complete after 3.723 msecs
[  118.303802] PM: early resume of devices complete after 3.723 msecs

So this issue needs to be investigated more.

Regards,

Tony

> Fixes: 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec")
> Cc: Keerthy <j-keerthy@ti.com>
> Cc: Tero Kristo <t-kristo@ti.com>
> Reported-by: Andrew F. Davis <afd@ti.com>
> Signed-off-by: Tony Lindgren <tony@atomide.com>
> ---
>  arch/arm/mach-omap2/omap-smp.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/arch/arm/mach-omap2/omap-smp.c b/arch/arm/mach-omap2/omap-smp.c
> --- a/arch/arm/mach-omap2/omap-smp.c
> +++ b/arch/arm/mach-omap2/omap-smp.c
> @@ -300,16 +300,6 @@ static void __init omap4_smp_prepare_cpus(unsigned int max_cpus)
>  		scu_enable(cfg.scu_base);
>  
>  	/*
> -	 * Reset CPU1 before configuring, otherwise kexec will
> -	 * end up trying to use old kernel startup address.
> -	 */
> -	if (cfg.cpu1_rstctrl_va) {
> -		writel_relaxed(1, cfg.cpu1_rstctrl_va);
> -		readl_relaxed(cfg.cpu1_rstctrl_va);
> -		writel_relaxed(0, cfg.cpu1_rstctrl_va);
> -	}
> -
> -	/*
>  	 * Write the address of secondary startup routine into the
>  	 * AuxCoreBoot1 where ROM code will jump and start executing
>  	 * on secondary core once out of WFE
> -- 
> 2.11.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Tony Lindgren Feb. 15, 2017, 6:39 p.m. UTC | #2
* Tony Lindgren <tony@atomide.com> [170214 11:39]:
> * Tony Lindgren <tony@atomide.com> [170213 13:51]:
> > Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
> > resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
> > on omap4 when doing kexec boot between two different kernel versions. The
> > booted kernel ended up trying to use the old kernel start-up address unless
> > cpu1 was reset before configuring the cpu1 start-up address.
> > 
> > It seems the reset part was not correct but probably working around some
> > other issue. I have not been able to reproduce this issue any longer despite
> > testing with backported patches back to v4.6 kernel. So it is possible this
> > issue was caused by other work in progress kexec patches I had applied. Or
> > it is possible some other fixes have made the issue go way.
> > 
> > The unconditional reset of cpu1 can cause issues booting some devices. For
> > example, bootloader configured secure OS running on cpu1 will fail as the
> > configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.
> > 
> > Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
> > still need to reset cpu1 in some cases, we can add it back and do it
> > conditionally.
> 
> Actually with this I'm now seeing cpu1 not come up after a suspend/resume
> cycle on duovero:
> 
> [  118.257415] CPU1: shutdown
> [  118.294616] Error taking CPU1 up: -2
> [  118.299072] PM: noirq resume of devices complete after 3.723 msecs
> [  118.303802] PM: early resume of devices complete after 3.723 msecs
> 
> So this issue needs to be investigated more.

And then today the omap4 suspend/resume issue is no longer reproducable..
Go figure.

But then doing more testing I noticed that also omap5 needs the reset.
Without it we get the following on omap5-uevm doing a kexec boot. So clearly
the reset cannot be just removed at least for omap4 and omap5.

Regards,

Tony

8< ---------------------
[    0.156796] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.163396] Setting up static identity map for 0x80100000 - 0x80100070
[    0.172246] smp: Bringing up secondary CPUs ...
[    0.178970] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[    0.178974] pgd = c0004000
[    0.178977] [00000000] *pgd=00000000
[    0.178990] Internal error: Oops: 80000005 [#1] SMP ARM
[    0.178995] Modules linked in:
[    0.179005] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.10.0-rc8-next-20170215+ #120
[    0.179008] Hardware name: Generic OMAP5 (Flattened Device Tree)
[    0.179013] task: ee0c8ec0 task.stack: ee0ca000
[    0.179018] PC is at 0x0
[    0.179029] LR is at omap4_cpu_die+0x58/0x98
[    0.179034] pc : [<00000000>]    lr : [<c01243dc>]    psr: 60000093
[    0.179034] sp : ee0cbfb8  ip : 00000000  fp : 00000000
[    0.179038] r10: 00000000  r9 : c0d50569  r8 : 00000000
[    0.179042] r7 : c0c76448  r6 : c0d0792c  r5 : 00000001  r4 : c0b08054
[    0.179046] r3 : 00000001  r2 : f0880000  r1 : 00000003  r0 : 00000001
[    0.179051] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[    0.179055] Control: 10c5387d  Table: 8000406a  DAC: 00000051
[    0.179059] Process swapper/1 (pid: 0, stack limit = 0xee0ca218)
[    0.179063] Stack: (0xee0cbfb8 to 0xee0cc000)
[    0.179068] bfa0:                                                       00000000 00000000
[    0.179075] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    0.179082] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 681b0041 cf3e4021
[    0.179092] [<c01243dc>] (omap4_cpu_die) from [<00000000>] (  (null))
[    0.179098] Code: bad PC value
[    0.179115] ---[ end trace e14406c260ce69db ]---
[    0.179121] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.179135] CPU0: stopping
[    0.179141] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
[    0.339715] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D         4.10.0-rc8-next-20170215+ #120
[    0.348927] Hardware name: Generic OMAP5 (Flattened Device Tree)
[    0.355112] [<c0110228>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
[    0.363083] [<c010c224>] (show_stack) from [<c04ca860>] (dump_stack+0xac/0xe0)
[    0.370513] [<c04ca860>] (dump_stack) from [<c010e72c>] (handle_IPI+0x358/0x3f8)
[    0.378120] [<c010e72c>] (handle_IPI) from [<c01015a4>] (gic_handle_irq+0x9c/0xb8)
[    0.385909] [<c01015a4>] (gic_handle_irq) from [<c083b270>] (__irq_svc+0x70/0x98)
[    0.393602] Exception stack(0xc0d01f38 to 0xc0d01f80)
[    0.398794] 1f20:                                                       c0108284 00000000
[    0.407205] 1f40: 00000000 00000000 c0d00000 c0d07994 c0d0792c c0c76448 c0d08560 c0d50569
[    0.415616] 1f60: 00000000 00000000 00000000 c0d01f88 c0108284 c0108288 60000013 ffffffff
[    0.424032] [<c083b270>] (__irq_svc) from [<c0108288>] (arch_cpu_idle+0x20/0x3c)
[    0.431643] [<c0108288>] (arch_cpu_idle) from [<c0190bc4>] (do_idle+0x164/0x218)
[    0.439251] [<c0190bc4>] (do_idle) from [<c0190ffc>] (cpu_startup_entry+0x18/0x1c)
[    0.447040] [<c0190ffc>] (cpu_startup_entry) from [<c0c00c40>] (start_kernel+0x35c/0x3d4)
[    0.455451] [<c0c00c40>] (start_kernel) from [<8000807c>] (0x8000807c)
Tony Lindgren Feb. 15, 2017, 7:12 p.m. UTC | #3
* Tony Lindgren <tony@atomide.com> [170215 10:40]:
> * Tony Lindgren <tony@atomide.com> [170214 11:39]:
> > * Tony Lindgren <tony@atomide.com> [170213 13:51]:
> > > Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
> > > resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
> > > on omap4 when doing kexec boot between two different kernel versions. The
> > > booted kernel ended up trying to use the old kernel start-up address unless
> > > cpu1 was reset before configuring the cpu1 start-up address.
> > > 
> > > It seems the reset part was not correct but probably working around some
> > > other issue. I have not been able to reproduce this issue any longer despite
> > > testing with backported patches back to v4.6 kernel. So it is possible this
> > > issue was caused by other work in progress kexec patches I had applied. Or
> > > it is possible some other fixes have made the issue go way.
> > > 
> > > The unconditional reset of cpu1 can cause issues booting some devices. For
> > > example, bootloader configured secure OS running on cpu1 will fail as the
> > > configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.
> > > 
> > > Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
> > > still need to reset cpu1 in some cases, we can add it back and do it
> > > conditionally.
> > 
> > Actually with this I'm now seeing cpu1 not come up after a suspend/resume
> > cycle on duovero:
> > 
> > [  118.257415] CPU1: shutdown
> > [  118.294616] Error taking CPU1 up: -2
> > [  118.299072] PM: noirq resume of devices complete after 3.723 msecs
> > [  118.303802] PM: early resume of devices complete after 3.723 msecs
> > 
> > So this issue needs to be investigated more.
> 
> And then today the omap4 suspend/resume issue is no longer reproducable..
> Go figure.
> 
> But then doing more testing I noticed that also omap5 needs the reset.
> Without it we get the following on omap5-uevm doing a kexec boot. So clearly
> the reset cannot be just removed at least for omap4 and omap5.

And also the same issue happens doing kexec on beagle-x15 naturally if
the cpu1 reset is removed.

Regards,

Tony

> 8< ---------------------
> [    0.156796] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
> [    0.163396] Setting up static identity map for 0x80100000 - 0x80100070
> [    0.172246] smp: Bringing up secondary CPUs ...
> [    0.178970] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [    0.178974] pgd = c0004000
> [    0.178977] [00000000] *pgd=00000000
> [    0.178990] Internal error: Oops: 80000005 [#1] SMP ARM
> [    0.178995] Modules linked in:
> [    0.179005] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.10.0-rc8-next-20170215+ #120
> [    0.179008] Hardware name: Generic OMAP5 (Flattened Device Tree)
> [    0.179013] task: ee0c8ec0 task.stack: ee0ca000
> [    0.179018] PC is at 0x0
> [    0.179029] LR is at omap4_cpu_die+0x58/0x98
> [    0.179034] pc : [<00000000>]    lr : [<c01243dc>]    psr: 60000093
> [    0.179034] sp : ee0cbfb8  ip : 00000000  fp : 00000000
> [    0.179038] r10: 00000000  r9 : c0d50569  r8 : 00000000
> [    0.179042] r7 : c0c76448  r6 : c0d0792c  r5 : 00000001  r4 : c0b08054
> [    0.179046] r3 : 00000001  r2 : f0880000  r1 : 00000003  r0 : 00000001
> [    0.179051] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [    0.179055] Control: 10c5387d  Table: 8000406a  DAC: 00000051
> [    0.179059] Process swapper/1 (pid: 0, stack limit = 0xee0ca218)
> [    0.179063] Stack: (0xee0cbfb8 to 0xee0cc000)
> [    0.179068] bfa0:                                                       00000000 00000000
> [    0.179075] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [    0.179082] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 681b0041 cf3e4021
> [    0.179092] [<c01243dc>] (omap4_cpu_die) from [<00000000>] (  (null))
> [    0.179098] Code: bad PC value
> [    0.179115] ---[ end trace e14406c260ce69db ]---
> [    0.179121] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.179135] CPU0: stopping
> [    0.179141] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.339715] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D         4.10.0-rc8-next-20170215+ #120
> [    0.348927] Hardware name: Generic OMAP5 (Flattened Device Tree)
> [    0.355112] [<c0110228>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
> [    0.363083] [<c010c224>] (show_stack) from [<c04ca860>] (dump_stack+0xac/0xe0)
> [    0.370513] [<c04ca860>] (dump_stack) from [<c010e72c>] (handle_IPI+0x358/0x3f8)
> [    0.378120] [<c010e72c>] (handle_IPI) from [<c01015a4>] (gic_handle_irq+0x9c/0xb8)
> [    0.385909] [<c01015a4>] (gic_handle_irq) from [<c083b270>] (__irq_svc+0x70/0x98)
> [    0.393602] Exception stack(0xc0d01f38 to 0xc0d01f80)
> [    0.398794] 1f20:                                                       c0108284 00000000
> [    0.407205] 1f40: 00000000 00000000 c0d00000 c0d07994 c0d0792c c0c76448 c0d08560 c0d50569
> [    0.415616] 1f60: 00000000 00000000 00000000 c0d01f88 c0108284 c0108288 60000013 ffffffff
> [    0.424032] [<c083b270>] (__irq_svc) from [<c0108288>] (arch_cpu_idle+0x20/0x3c)
> [    0.431643] [<c0108288>] (arch_cpu_idle) from [<c0190bc4>] (do_idle+0x164/0x218)
> [    0.439251] [<c0190bc4>] (do_idle) from [<c0190ffc>] (cpu_startup_entry+0x18/0x1c)
> [    0.447040] [<c0190ffc>] (cpu_startup_entry) from [<c0c00c40>] (start_kernel+0x35c/0x3d4)
> [    0.455451] [<c0c00c40>] (start_kernel) from [<8000807c>] (0x8000807c)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Davis Feb. 15, 2017, 10:13 p.m. UTC | #4
On 02/15/2017 01:12 PM, Tony Lindgren wrote:
> * Tony Lindgren <tony@atomide.com> [170215 10:40]:
>> * Tony Lindgren <tony@atomide.com> [170214 11:39]:
>>> * Tony Lindgren <tony@atomide.com> [170213 13:51]:
>>>> Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
>>>> resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
>>>> on omap4 when doing kexec boot between two different kernel versions. The
>>>> booted kernel ended up trying to use the old kernel start-up address unless
>>>> cpu1 was reset before configuring the cpu1 start-up address.
>>>>
>>>> It seems the reset part was not correct but probably working around some
>>>> other issue. I have not been able to reproduce this issue any longer despite
>>>> testing with backported patches back to v4.6 kernel. So it is possible this
>>>> issue was caused by other work in progress kexec patches I had applied. Or
>>>> it is possible some other fixes have made the issue go way.
>>>>
>>>> The unconditional reset of cpu1 can cause issues booting some devices. For
>>>> example, bootloader configured secure OS running on cpu1 will fail as the
>>>> configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.
>>>>
>>>> Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
>>>> still need to reset cpu1 in some cases, we can add it back and do it
>>>> conditionally.
>>>
>>> Actually with this I'm now seeing cpu1 not come up after a suspend/resume
>>> cycle on duovero:
>>>
>>> [  118.257415] CPU1: shutdown
>>> [  118.294616] Error taking CPU1 up: -2
>>> [  118.299072] PM: noirq resume of devices complete after 3.723 msecs
>>> [  118.303802] PM: early resume of devices complete after 3.723 msecs
>>>
>>> So this issue needs to be investigated more.
>>
>> And then today the omap4 suspend/resume issue is no longer reproducable..
>> Go figure.
>>
>> But then doing more testing I noticed that also omap5 needs the reset.
>> Without it we get the following on omap5-uevm doing a kexec boot. So clearly
>> the reset cannot be just removed at least for omap4 and omap5.
> 
> And also the same issue happens doing kexec on beagle-x15 naturally if
> the cpu1 reset is removed.
> 

When a core actually powers up it idles in ROM code waiting for
OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
powered off, we just let it spin in omap4_cpu_die() or
omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
it were still trapped in ROM after a reset.

The issue with this fake startup idle loop is that, unlike the ROM based
startup idle loop, these do *not* jump to the address we stored in
OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
jump to the kernel startup function.

So when we tell this core to boot, and it is not in the real ROM startup
loop, it breaks stuff as it jumps to the old kernel's
secondary_startup() even though we gave it the correct address in
OMAP_AUX_CORE_BOOT_1.

Reseting the core to put it back in the real ROM idle loop is wrong, the
two idle loop functions above should be fixed to respect the address in
OMAP_AUX_CORE_BOOT_1 and not to make assumptions, this should take care
of the kexec failure in a sane way.

Andrew

> Regards,
> 
> Tony
> 
>> 8< ---------------------
>> [    0.156796] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
>> [    0.163396] Setting up static identity map for 0x80100000 - 0x80100070
>> [    0.172246] smp: Bringing up secondary CPUs ...
>> [    0.178970] Unable to handle kernel NULL pointer dereference at virtual address 00000000
>> [    0.178974] pgd = c0004000
>> [    0.178977] [00000000] *pgd=00000000
>> [    0.178990] Internal error: Oops: 80000005 [#1] SMP ARM
>> [    0.178995] Modules linked in:
>> [    0.179005] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.10.0-rc8-next-20170215+ #120
>> [    0.179008] Hardware name: Generic OMAP5 (Flattened Device Tree)
>> [    0.179013] task: ee0c8ec0 task.stack: ee0ca000
>> [    0.179018] PC is at 0x0
>> [    0.179029] LR is at omap4_cpu_die+0x58/0x98
>> [    0.179034] pc : [<00000000>]    lr : [<c01243dc>]    psr: 60000093
>> [    0.179034] sp : ee0cbfb8  ip : 00000000  fp : 00000000
>> [    0.179038] r10: 00000000  r9 : c0d50569  r8 : 00000000
>> [    0.179042] r7 : c0c76448  r6 : c0d0792c  r5 : 00000001  r4 : c0b08054
>> [    0.179046] r3 : 00000001  r2 : f0880000  r1 : 00000003  r0 : 00000001
>> [    0.179051] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
>> [    0.179055] Control: 10c5387d  Table: 8000406a  DAC: 00000051
>> [    0.179059] Process swapper/1 (pid: 0, stack limit = 0xee0ca218)
>> [    0.179063] Stack: (0xee0cbfb8 to 0xee0cc000)
>> [    0.179068] bfa0:                                                       00000000 00000000
>> [    0.179075] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>> [    0.179082] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 681b0041 cf3e4021
>> [    0.179092] [<c01243dc>] (omap4_cpu_die) from [<00000000>] (  (null))
>> [    0.179098] Code: bad PC value
>> [    0.179115] ---[ end trace e14406c260ce69db ]---
>> [    0.179121] Kernel panic - not syncing: Attempted to kill the idle task!
>> [    0.179135] CPU0: stopping
>> [    0.179141] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
>> [    0.339715] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D         4.10.0-rc8-next-20170215+ #120
>> [    0.348927] Hardware name: Generic OMAP5 (Flattened Device Tree)
>> [    0.355112] [<c0110228>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
>> [    0.363083] [<c010c224>] (show_stack) from [<c04ca860>] (dump_stack+0xac/0xe0)
>> [    0.370513] [<c04ca860>] (dump_stack) from [<c010e72c>] (handle_IPI+0x358/0x3f8)
>> [    0.378120] [<c010e72c>] (handle_IPI) from [<c01015a4>] (gic_handle_irq+0x9c/0xb8)
>> [    0.385909] [<c01015a4>] (gic_handle_irq) from [<c083b270>] (__irq_svc+0x70/0x98)
>> [    0.393602] Exception stack(0xc0d01f38 to 0xc0d01f80)
>> [    0.398794] 1f20:                                                       c0108284 00000000
>> [    0.407205] 1f40: 00000000 00000000 c0d00000 c0d07994 c0d0792c c0c76448 c0d08560 c0d50569
>> [    0.415616] 1f60: 00000000 00000000 00000000 c0d01f88 c0108284 c0108288 60000013 ffffffff
>> [    0.424032] [<c083b270>] (__irq_svc) from [<c0108288>] (arch_cpu_idle+0x20/0x3c)
>> [    0.431643] [<c0108288>] (arch_cpu_idle) from [<c0190bc4>] (do_idle+0x164/0x218)
>> [    0.439251] [<c0190bc4>] (do_idle) from [<c0190ffc>] (cpu_startup_entry+0x18/0x1c)
>> [    0.447040] [<c0190ffc>] (cpu_startup_entry) from [<c0c00c40>] (start_kernel+0x35c/0x3d4)
>> [    0.455451] [<c0c00c40>] (start_kernel) from [<8000807c>] (0x8000807c)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Lindgren Feb. 15, 2017, 10:27 p.m. UTC | #5
* Andrew F. Davis <afd@ti.com> [170215 14:14]:
> On 02/15/2017 01:12 PM, Tony Lindgren wrote:
> > * Tony Lindgren <tony@atomide.com> [170215 10:40]:
> >> * Tony Lindgren <tony@atomide.com> [170214 11:39]:
> >>> * Tony Lindgren <tony@atomide.com> [170213 13:51]:
> >>>> Commit 3251885285e1 ("ARM: OMAP4+: Reset CPU1 properly for kexec") started
> >>>> resetting cpu1 because of a kexec boot issue I was seeing earlier in 2016
> >>>> on omap4 when doing kexec boot between two different kernel versions. The
> >>>> booted kernel ended up trying to use the old kernel start-up address unless
> >>>> cpu1 was reset before configuring the cpu1 start-up address.
> >>>>
> >>>> It seems the reset part was not correct but probably working around some
> >>>> other issue. I have not been able to reproduce this issue any longer despite
> >>>> testing with backported patches back to v4.6 kernel. So it is possible this
> >>>> issue was caused by other work in progress kexec patches I had applied. Or
> >>>> it is possible some other fixes have made the issue go way.
> >>>>
> >>>> The unconditional reset of cpu1 can cause issues booting some devices. For
> >>>> example, bootloader configured secure OS running on cpu1 will fail as the
> >>>> configuration is not preserved as reported by Andrew F. Davis <afd@ti.com>.
> >>>>
> >>>> Let's fix the issue by reverting the cpu1 reset parts. If it turns out we
> >>>> still need to reset cpu1 in some cases, we can add it back and do it
> >>>> conditionally.
> >>>
> >>> Actually with this I'm now seeing cpu1 not come up after a suspend/resume
> >>> cycle on duovero:
> >>>
> >>> [  118.257415] CPU1: shutdown
> >>> [  118.294616] Error taking CPU1 up: -2
> >>> [  118.299072] PM: noirq resume of devices complete after 3.723 msecs
> >>> [  118.303802] PM: early resume of devices complete after 3.723 msecs
> >>>
> >>> So this issue needs to be investigated more.
> >>
> >> And then today the omap4 suspend/resume issue is no longer reproducable..
> >> Go figure.
> >>
> >> But then doing more testing I noticed that also omap5 needs the reset.
> >> Without it we get the following on omap5-uevm doing a kexec boot. So clearly
> >> the reset cannot be just removed at least for omap4 and omap5.
> > 
> > And also the same issue happens doing kexec on beagle-x15 naturally if
> > the cpu1 reset is removed.
> > 
> 
> When a core actually powers up it idles in ROM code waiting for
> OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
> powered off, we just let it spin in omap4_cpu_die() or
> omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
> it were still trapped in ROM after a reset.
> 
> The issue with this fake startup idle loop is that, unlike the ROM based
> startup idle loop, these do *not* jump to the address we stored in
> OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
> jump to the kernel startup function.
> 
> So when we tell this core to boot, and it is not in the real ROM startup
> loop, it breaks stuff as it jumps to the old kernel's
> secondary_startup() even though we gave it the correct address in
> OMAP_AUX_CORE_BOOT_1.

Yes this is probably what's going on here. Note that the error I pasted
was booting the same kernel where that address should be correct though.
So there might be something else to it also.

> Reseting the core to put it back in the real ROM idle loop is wrong, the
> two idle loop functions above should be fixed to respect the address in
> OMAP_AUX_CORE_BOOT_1 and not to make assumptions, this should take care
> of the kexec failure in a sane way.

OK care to try to patch it as now you also have a reproducable test
case for kexec too?

Regards,

Tony
Tony Lindgren Feb. 16, 2017, 4:10 p.m. UTC | #6
* Tony Lindgren <tony@atomide.com> [170215 14:28]:
> * Andrew F. Davis <afd@ti.com> [170215 14:14]:
> > On 02/15/2017 01:12 PM, Tony Lindgren wrote:
> > > And also the same issue happens doing kexec on beagle-x15 naturally if
> > > the cpu1 reset is removed.
> > > 
> > 
> > When a core actually powers up it idles in ROM code waiting for
> > OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
> > powered off, we just let it spin in omap4_cpu_die() or
> > omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
> > it were still trapped in ROM after a reset.

OK so I debugged this a bit more. We have CPU1 in omap_do_wfi()
as we don't currently have omap5_secondary_startup() or any deeper
idle mode support beyond retention for omap5 or dra7 in the mainline
kernel.

> > The issue with this fake startup idle loop is that, unlike the ROM based
> > startup idle loop, these do *not* jump to the address we stored in
> > OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
> > jump to the kernel startup function.

This does not seem to be the case here.

> > So when we tell this core to boot, and it is not in the real ROM startup
> > loop, it breaks stuff as it jumps to the old kernel's
> > secondary_startup() even though we gave it the correct address in
> > OMAP_AUX_CORE_BOOT_1.

And this is not happening. I think this is what I was seeing earlier,
but it's not the omap5/dra7 issue.

What we have is cpu1 returning from previous kernel's omap_do_wfi()
in the kexec booted kernel's code and that's when things go wrong.

So if cpu1 was configured for idle for any reason, it will never gets
to omap5_secondary_startup without the reset currently.

The reason kexec and suspend/resume mostly works for omap4 without
cpu1 reset is that we usually enter off mode for cpu1 and the context
is lost and then we properly go through omap4_secondary_startup. Or
that's my theory so far for the occasional flakeyness I've been seeing :)

Any ideas what we should try to check to see if cpu1 is in idle
mode so we can do the reset if needed?

Regards,

Tony
Tony Lindgren Feb. 16, 2017, 4:21 p.m. UTC | #7
* Tony Lindgren <tony@atomide.com> [170216 08:11]:
> * Tony Lindgren <tony@atomide.com> [170215 14:28]:
> > * Andrew F. Davis <afd@ti.com> [170215 14:14]:
> > > On 02/15/2017 01:12 PM, Tony Lindgren wrote:
> > > > And also the same issue happens doing kexec on beagle-x15 naturally if
> > > > the cpu1 reset is removed.
> > > > 
> > > 
> > > When a core actually powers up it idles in ROM code waiting for
> > > OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
> > > powered off, we just let it spin in omap4_cpu_die() or
> > > omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
> > > it were still trapped in ROM after a reset.
> 
> OK so I debugged this a bit more. We have CPU1 in omap_do_wfi()
> as we don't currently have omap5_secondary_startup() or any deeper
> idle mode support beyond retention for omap5 or dra7 in the mainline
> kernel.
> 
> > > The issue with this fake startup idle loop is that, unlike the ROM based
> > > startup idle loop, these do *not* jump to the address we stored in
> > > OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
> > > jump to the kernel startup function.
> 
> This does not seem to be the case here.
> 
> > > So when we tell this core to boot, and it is not in the real ROM startup
> > > loop, it breaks stuff as it jumps to the old kernel's
> > > secondary_startup() even though we gave it the correct address in
> > > OMAP_AUX_CORE_BOOT_1.
> 
> And this is not happening. I think this is what I was seeing earlier,
> but it's not the omap5/dra7 issue.
> 
> What we have is cpu1 returning from previous kernel's omap_do_wfi()
> in the kexec booted kernel's code and that's when things go wrong.
> 
> So if cpu1 was configured for idle for any reason, it will never gets
> to omap5_secondary_startup without the reset currently.
> 
> The reason kexec and suspend/resume mostly works for omap4 without
> cpu1 reset is that we usually enter off mode for cpu1 and the context
> is lost and then we properly go through omap4_secondary_startup. Or
> that's my theory so far for the occasional flakeyness I've been seeing :)
> 
> Any ideas what we should try to check to see if cpu1 is in idle
> mode so we can do the reset if needed?

Maybe we should do the reset if OMAP5_CPU1_WAKEUP_NS_PA_ADDR_OFFSET
is not 0?

Regards,

Tony
Andrew Davis Feb. 16, 2017, 4:29 p.m. UTC | #8
On 02/16/2017 10:10 AM, Tony Lindgren wrote:
> * Tony Lindgren <tony@atomide.com> [170215 14:28]:
>> * Andrew F. Davis <afd@ti.com> [170215 14:14]:
>>> On 02/15/2017 01:12 PM, Tony Lindgren wrote:
>>>> And also the same issue happens doing kexec on beagle-x15 naturally if
>>>> the cpu1 reset is removed.
>>>>
>>>
>>> When a core actually powers up it idles in ROM code waiting for
>>> OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
>>> powered off, we just let it spin in omap4_cpu_die() or
>>> omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
>>> it were still trapped in ROM after a reset.
> 
> OK so I debugged this a bit more. We have CPU1 in omap_do_wfi()
> as we don't currently have omap5_secondary_startup() or any deeper
> idle mode support beyond retention for omap5 or dra7 in the mainline
> kernel.
> 
>>> The issue with this fake startup idle loop is that, unlike the ROM based
>>> startup idle loop, these do *not* jump to the address we stored in
>>> OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
>>> jump to the kernel startup function.
> 
> This does not seem to be the case here.
> 

Well this is what I am seeing every time, this code only works when it
is the same kernel we kexec, any changed addresses here will not work.

>>> So when we tell this core to boot, and it is not in the real ROM startup
>>> loop, it breaks stuff as it jumps to the old kernel's
>>> secondary_startup() even though we gave it the correct address in
>>> OMAP_AUX_CORE_BOOT_1.
> 
> And this is not happening. I think this is what I was seeing earlier,
> but it's not the omap5/dra7 issue.
> 
> What we have is cpu1 returning from previous kernel's omap_do_wfi()
> in the kexec booted kernel's code and that's when things go wrong.
> 

We are the ones sending it to omap_do_wfi(), in omap4_cpu_die() it gets
idled in a loop, it shouldn't be idled after it is shut off, it should
get parked, we should do this like we do in omap5_secondary_startup().

> So if cpu1 was configured for idle for any reason, it will never gets
> to omap5_secondary_startup without the reset currently.
> 
> The reason kexec and suspend/resume mostly works for omap4 without
> cpu1 reset is that we usually enter off mode for cpu1 and the context
> is lost and then we properly go through omap4_secondary_startup. Or
> that's my theory so far for the occasional flakeyness I've been seeing :)
> 
> Any ideas what we should try to check to see if cpu1 is in idle
> mode so we can do the reset if needed?
> 

You can never reset the core, resetting the core is not allowed on HS
devices and so it really doesn't matter what the core is doing. In no
case is reseting the core a valid work-around for not correctly parking
it. We need to fix the omap4_cpu_die() to not let the core go idle if
the return from idle path is the problem.

Andrew

> Regards,
> 
> Tony
>
Tony Lindgren Feb. 16, 2017, 4:54 p.m. UTC | #9
* Andrew F. Davis <afd@ti.com> [170216 08:30]:
> On 02/16/2017 10:10 AM, Tony Lindgren wrote:
> > * Tony Lindgren <tony@atomide.com> [170215 14:28]:
> >> * Andrew F. Davis <afd@ti.com> [170215 14:14]:
> >>> On 02/15/2017 01:12 PM, Tony Lindgren wrote:
> >>>> And also the same issue happens doing kexec on beagle-x15 naturally if
> >>>> the cpu1 reset is removed.
> >>>>
> >>>
> >>> When a core actually powers up it idles in ROM code waiting for
> >>> OMAP_AUX_CORE_BOOT_0 to be set. When we shutdown a core it is not really
> >>> powered off, we just let it spin in omap4_cpu_die() or
> >>> omap4_secondary_startup() waiting on OMAP_AUX_CORE_BOOT_0, just like if
> >>> it were still trapped in ROM after a reset.
> > 
> > OK so I debugged this a bit more. We have CPU1 in omap_do_wfi()
> > as we don't currently have omap5_secondary_startup() or any deeper
> > idle mode support beyond retention for omap5 or dra7 in the mainline
> > kernel.
> > 
> >>> The issue with this fake startup idle loop is that, unlike the ROM based
> >>> startup idle loop, these do *not* jump to the address we stored in
> >>> OMAP_AUX_CORE_BOOT_1, they just make the assumption that they can safely
> >>> jump to the kernel startup function.
> > 
> > This does not seem to be the case here.
> > 
> 
> Well this is what I am seeing every time, this code only works when it
> is the same kernel we kexec, any changed addresses here will not work.

Hmm let's talk the mainline kernel here. Currently things do work in
the mainline kernel because of the cpu1 reset. And without cpu1 reset
things will currently go wrong in the mainline kernel both for kexec
and suspend/resume.

> >>> So when we tell this core to boot, and it is not in the real ROM startup
> >>> loop, it breaks stuff as it jumps to the old kernel's
> >>> secondary_startup() even though we gave it the correct address in
> >>> OMAP_AUX_CORE_BOOT_1.
> > 
> > And this is not happening. I think this is what I was seeing earlier,
> > but it's not the omap5/dra7 issue.
> > 
> > What we have is cpu1 returning from previous kernel's omap_do_wfi()
> > in the kexec booted kernel's code and that's when things go wrong.
> > 
> 
> We are the ones sending it to omap_do_wfi(), in omap4_cpu_die() it gets
> idled in a loop, it shouldn't be idled after it is shut off, it should
> get parked, we should do this like we do in omap5_secondary_startup().

Yup agreed. We need to figure out if it's just normal cpuidle hot-unplug
event vs shut down and park for kexec. Probably cpu_kill() is the place
to park it, need to check.

> > So if cpu1 was configured for idle for any reason, it will never gets
> > to omap5_secondary_startup without the reset currently.
> > 
> > The reason kexec and suspend/resume mostly works for omap4 without
> > cpu1 reset is that we usually enter off mode for cpu1 and the context
> > is lost and then we properly go through omap4_secondary_startup. Or
> > that's my theory so far for the occasional flakeyness I've been seeing :)
> > 
> > Any ideas what we should try to check to see if cpu1 is in idle
> > mode so we can do the reset if needed?
> > 
> 
> You can never reset the core, resetting the core is not allowed on HS
> devices and so it really doesn't matter what the core is doing. In no
> case is reseting the core a valid work-around for not correctly parking
> it. We need to fix the omap4_cpu_die() to not let the core go idle if
> the return from idle path is the problem.

Yeah well from Linux point of view, what we're interested in is that
cpu1 comes up reliably in all cases no matter what it takes. I agree
doing a reset on it should be only done if nothing else helps. And I
can see some HS implementations not allowing cpu1 reset. And I can see
some product specific bootloaders idle cpu1 and that's where things
break again.

For your use case, probably all we need is runtime checks for HS in
addition to parking cpu1 for kexec. If that's not enough, then maybe
a device specific DT property for never-reset-no-matter-what.

Regards,

Tony
diff mbox

Patch

diff --git a/arch/arm/mach-omap2/omap-smp.c b/arch/arm/mach-omap2/omap-smp.c
--- a/arch/arm/mach-omap2/omap-smp.c
+++ b/arch/arm/mach-omap2/omap-smp.c
@@ -300,16 +300,6 @@  static void __init omap4_smp_prepare_cpus(unsigned int max_cpus)
 		scu_enable(cfg.scu_base);
 
 	/*
-	 * Reset CPU1 before configuring, otherwise kexec will
-	 * end up trying to use old kernel startup address.
-	 */
-	if (cfg.cpu1_rstctrl_va) {
-		writel_relaxed(1, cfg.cpu1_rstctrl_va);
-		readl_relaxed(cfg.cpu1_rstctrl_va);
-		writel_relaxed(0, cfg.cpu1_rstctrl_va);
-	}
-
-	/*
 	 * Write the address of secondary startup routine into the
 	 * AuxCoreBoot1 where ROM code will jump and start executing
 	 * on secondary core once out of WFE