Message ID | 55D25C64.3090107@ti.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Aug 17, 2015 at 06:12:52PM -0400, Murali Karicheri wrote: > Unfortunately, this patch causes boot to stop very early just after > local_abt_enable() is called in early_trap_init(). Before and After applying > the patch, here is what the boot log looks like. Do you see any issue with > the patch diff shown below? Patch is applied on top of v4.2-rc7. I have some > additional base port patches applied to boot kernel on my EVM based on a new > SoC. Try moving the call to local_abt_enable() below forward to the end of devicemaps_init(). I suspect this is too early for the abort handlers to reliably run. > diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c > index d358226..381c4e4 100644 > --- a/arch/arm/kernel/traps.c > +++ b/arch/arm/kernel/traps.c > @@ -871,6 +871,11 @@ void __init early_trap_init(void *vectors_base) > > flush_icache_range(vectors, vectors + PAGE_SIZE * 2); > modify_domain(DOMAIN_USER, DOMAIN_CLIENT); > + > + /* Enable imprecise aborts */ > + local_abt_enable(); > +
Murali, On 8/17/15 3:12 PM, Murali Karicheri wrote: > On 08/14/2015 05:56 PM, Russell King - ARM Linux wrote: >> On Fri, Aug 14, 2015 at 05:53:00PM -0400, Murali Karicheri wrote: >>> We have spend some time already to debug the root cause. Do you have >>> idea on >>> how this was hunted down on OMAP that we can learn from? The bad >>> address is >>> NULL and it seems to happen very rarely and is not easily reproducible. >>> Don't want to put this workaround, but we couldn't track it down >>> either. So >>> any help to debug this will be appreciated. >> >> If you try applying Lucas' patch, you should receive the abort earlier >> in the kernel boot up, which may help narrow down what is provoking it. >> > > Unfortunately, this patch causes boot to stop very early just after > local_abt_enable() is called in early_trap_init(). Before and After > applying the patch, here is what the boot log looks like. Do you see any > issue with the patch diff shown below? Patch is applied on top of > v4.2-rc7. I have some additional base port patches applied to boot > kernel on my EVM based on a new SoC. > From the logs this seems to be mostly clock related issue for some peripheral. If the bootloader clock enable all hack still exists, may be you can try that out. Another way to debug this is to start disabling peripheral drivers from the kernel 1 by 1 and see if the issue goes away. Regards, Santosh
On Mon, Aug 17, 2015 at 08:09:17PM -0700, santosh.shilimkar@oracle.com wrote: > From the logs this seems to be mostly clock related issue for some > peripheral. If the bootloader clock enable all hack still exists, > may be you can try that out. > > Another way to debug this is to start disabling peripheral drivers > from the kernel 1 by 1 and see if the issue goes away. Highly unlikely to make any difference. As the failure happens soo early with the patch applied, the kernel hasn't had much of a chance to touch the hardware - about the only things are the decompressor and the kernel touching the early console. As they seem to be working, it suggests that's not the cause. It seems to be pointing towards something in the boot loader... Normally, uboot will hook itself into the vectors to report errors, but I wonder whether uboot enables asynchronous aborts while it's running. Don't forget to make sure that the aborts are disabled again prior to calling the kernel.
Am Dienstag, den 18.08.2015, 09:13 +0100 schrieb Russell King - ARM Linux: > On Mon, Aug 17, 2015 at 08:09:17PM -0700, santosh.shilimkar@oracle.com wrote: > > From the logs this seems to be mostly clock related issue for some > > peripheral. If the bootloader clock enable all hack still exists, > > may be you can try that out. > > > > Another way to debug this is to start disabling peripheral drivers > > from the kernel 1 by 1 and see if the issue goes away. > > Highly unlikely to make any difference. As the failure happens soo early > with the patch applied, the kernel hasn't had much of a chance to touch > the hardware - about the only things are the decompressor and the kernel > touching the early console. As they seem to be working, it suggests > that's not the cause. > > It seems to be pointing towards something in the boot loader... > > Normally, uboot will hook itself into the vectors to report errors, but > I wonder whether uboot enables asynchronous aborts while it's running. > Don't forget to make sure that the aborts are disabled again prior to > calling the kernel. > At least one of the Marvell platforms has the same issue with the bootloader (I think it is some downstream U-Boot) leaving an imprecise abort hanging around as a nice present for Linux to crash on. If it turns out to be the same issue the only kernel level workaround would be to ignore exactly 1 abort after bootup. Then we still need a solution for the platform and the PCIe driver abort handler both hooking into the same abort vector, which won't work currently. Regards, Lucas
On Tue, 18 Aug 2015 09:13:34 +0100 Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Mon, Aug 17, 2015 at 08:09:17PM -0700, santosh.shilimkar@oracle.com wrote: > > From the logs this seems to be mostly clock related issue for some > > peripheral. If the bootloader clock enable all hack still exists, > > may be you can try that out. > > > > Another way to debug this is to start disabling peripheral drivers > > from the kernel 1 by 1 and see if the issue goes away. > > Highly unlikely to make any difference. As the failure happens soo early > with the patch applied, the kernel hasn't had much of a chance to touch > the hardware - about the only things are the decompressor and the kernel > touching the early console. As they seem to be working, it suggests > that's not the cause. > > It seems to be pointing towards something in the boot loader... > > Normally, uboot will hook itself into the vectors to report errors, but > I wonder whether uboot enables asynchronous aborts while it's running. > Don't forget to make sure that the aborts are disabled again prior to > calling the kernel. > Another possible cause: trustzone software. we root caused such kind of asynchronous external abort on Marvell Berlin SoCs to a trustzone bug. I'm not sure whether keystone linux is running at normal world or not.
Hi Murali, On Tue, Aug 18, 2015 at 10:28:20AM +0200, Lucas Stach wrote: > Am Dienstag, den 18.08.2015, 09:13 +0100 schrieb Russell King - ARM > Linux: > > It seems to be pointing towards something in the boot loader... > > > > Normally, uboot will hook itself into the vectors to report errors, but > > I wonder whether uboot enables asynchronous aborts while it's running. > > Don't forget to make sure that the aborts are disabled again prior to > > calling the kernel. > > > At least one of the Marvell platforms has the same issue with the > bootloader (I think it is some downstream U-Boot) leaving an imprecise > abort hanging around as a nice present for Linux to crash on. If you have a JTAG, maybe you can manually set CPSR.A bit (equivalent of Lucas's patch) at bootloader/kernel entry and conclude who is the culprit or maybe even localize it better. This method did help in rootcausing issue in one of the SoC that showed the same behaviour. Regards Afzal
On 08/18/2015 04:28 AM, Jisheng Zhang wrote: > On Tue, 18 Aug 2015 09:13:34 +0100 > Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > >> On Mon, Aug 17, 2015 at 08:09:17PM -0700, santosh.shilimkar@oracle.com wrote: >>> From the logs this seems to be mostly clock related issue for some >>> peripheral. If the bootloader clock enable all hack still exists, >>> may be you can try that out. >>> >>> Another way to debug this is to start disabling peripheral drivers >>> from the kernel 1 by 1 and see if the issue goes away. >> >> Highly unlikely to make any difference. As the failure happens soo early >> with the patch applied, the kernel hasn't had much of a chance to touch >> the hardware - about the only things are the decompressor and the kernel >> touching the early console. As they seem to be working, it suggests >> that's not the cause. >> >> It seems to be pointing towards something in the boot loader... >> >> Normally, uboot will hook itself into the vectors to report errors, but >> I wonder whether uboot enables asynchronous aborts while it's running. >> Don't forget to make sure that the aborts are disabled again prior to >> calling the kernel. >> > > Another possible cause: trustzone software. > > we root caused such kind of asynchronous external abort on Marvell Berlin SoCs > to a trustzone bug. I'm not sure whether keystone linux is running at normal > world or not. Yes, in normal world (Non secure supervisor) >
Russell, On 08/18/2015 04:13 AM, Russell King - ARM Linux wrote: > On Mon, Aug 17, 2015 at 08:09:17PM -0700, santosh.shilimkar@oracle.com wrote: >> From the logs this seems to be mostly clock related issue for some >> peripheral. If the bootloader clock enable all hack still exists, >> may be you can try that out. >> >> Another way to debug this is to start disabling peripheral drivers >> from the kernel 1 by 1 and see if the issue goes away. > > Highly unlikely to make any difference. As the failure happens soo early > with the patch applied, the kernel hasn't had much of a chance to touch > the hardware - about the only things are the decompressor and the kernel > touching the early console. As they seem to be working, it suggests > that's not the cause. > > It seems to be pointing towards something in the boot loader... > > Normally, uboot will hook itself into the vectors to report errors, but > I wonder whether uboot enables asynchronous aborts while it's running. > Don't forget to make sure that the aborts are disabled again prior to > calling the kernel. > Thanks for your input. The patch works now once I move the local_abort_enable() to later just before calling reserve_crashkernel() in setup_arch(). The abort handler gets called right after enabling it which means it has happened even before reaching here. I have added the abort handler to u-boot code and I get the same abort which means the root cause is u-boot or ROM boot loader. I would try to debug if root cause is u-boot. If it is ROM boot loader, I will have to add a work around in u-boot or Linux. Is there a preference of one over the other? The exception handling in u-boot is premature and will require more work to add a work around. Is there still a possibility of adding the work around in Linux?
diff --git a/arch/arm/include/asm/irqflags.h b/arch/arm/include/asm/irqflags.h index 4390814..ac1e7e9 100644 --- a/arch/arm/include/asm/irqflags.h +++ b/arch/arm/include/asm/irqflags.h @@ -54,6 +54,14 @@ static inline void arch_local_irq_disable(void) #define local_fiq_enable() __asm__("cpsie f @ __stf" : : : "memory", "cc") #define local_fiq_disable() __asm__("cpsid f @ __clf" : : : "memory", "cc") + +#ifndef CONFIG_CPU_V7M +#define local_abt_enable() __asm__("cpsie a @ __sta" : : : "memory", "cc") +#define local_abt_disable() __asm__("cpsid a @ __cla" : : : "memory", "cc") +#else +#define local_abt_enable() do { } while (0) +#define local_abt_disable() do { } while (0) +#endif #else /* @@ -136,6 +144,8 @@ static inline void arch_local_irq_disable(void) : "memory", "cc"); \ }) +#define local_abt_enable() do { } while (0) +#define local_abt_disable() do { } while (0) #endif /* diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 3d6b782..27c944b 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -358,7 +358,7 @@ asmlinkage void secondary_start_kernel(void) cpu_init(); pr_debug("CPU%u: Booted secondary processor\n", cpu); preempt_disable(); trace_hardirqs_off(); @@ -385,6 +385,7 @@ asmlinkage void secondary_start_kernel(void) local_irq_enable(); local_fiq_enable(); + local_abt_enable(); /* * OK, it's off to the idle thread for us diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index d358226..381c4e4 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -871,6 +871,11 @@ void __init early_trap_init(void *vectors_base) flush_icache_range(vectors, vectors + PAGE_SIZE * 2); modify_domain(DOMAIN_USER, DOMAIN_CLIENT); + + /* Enable imprecise aborts */ + local_abt_enable(); + #else /* ifndef CONFIG_CPU_V7M */ /* * on V7-M there is no need to copy the vector table to a dedicated