ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled

Message ID	5375079E.3060305@ti.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Message-ID: <5375079E.3060305@ti.com> Date: Thu, 15 May 2014 14:29:50 -0400 From: Santosh Shilimkar <santosh.shilimkar@ti.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Daniel Lezcano <daniel.lezcano@linaro.org>, Alex Shi <alex.shi@linaro.org> Subject: Re: [PATCH] ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled References: <1399991966-14582-1-git-send-email-santosh.shilimkar@ti.com> <5373C78E.7040301@linaro.org> <5373C8E8.6040208@ti.com> <5373CBB9.40506@linaro.org> <5373DD94.8030208@ti.com> <5374F35E.5040900@ti.com> <5374FE61.50901@linaro.org> <5374FF5B.8050702@ti.com> In-Reply-To: <5374FF5B.8050702@ti.com> Cc: tony@atomide.com, linux-omap@vger.kernel.org, khilman@linaro.org, linux-arm-kernel@lists.infradead.org, Roger Quadros <rogerq@ti.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Santosh Shilimkar May 15, 2014, 6:29 p.m. UTC

On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote:
> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote:
>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote:

[..]

>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does
>>>> its won registration where it can start the timer. The way it was before the
>>>> consolidation.
>>>>
>>>> Ofcourse if you have better fix, then great.
>>>>
>>> What is your suggestion. We *must* fix the regression asap. I think
>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED
>>> seems a good way forward.
>>>
>>> Do let me know.
>>
>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang.
>>
> The hang is definitely due to the bctimer not started. As I said, I assumed it was and
> then you corrected saying it is under the flag.
> 
>> I am not convinced the culprit is this code you are trying to revert.
>>
> fair enough. Thats why I said if you have an alternative fix thats great.
> 
For record, below is updated patch with bctimer started which
was missed in earlier version. I haven't tested it though.

Alex,
Please give a try with your test-case and see if you still see the hang.
Am just curious about your issue and hence the request..

Regards,
Santosh

From bb3b82cc5645b83bedf1343d03cc956f27f6fc83 Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar <santosh.shilimkar@ti.com>
Date: Mon, 12 May 2014 17:37:59 -0400
Subject: [PATCH] ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled

On OMAP4 panda board, there have been several bug reports about boot
hang and lock-ups with CPU_IDLE enabled. The root cause of the issue
is missing interrupts while in idle state. Commit cb7094e8 {cpuidle / omap4 :
use CPUIDLE_FLAG_TIMER_STOP flag} moved the broadcast notifiers to common
code for right reasons but on OMAP4 which suffers from a nasty ROM code
bug with GIC, commit ff999b8a {ARM: OMAP4460: Workaround for ROM bug ..},
we loose interrupts which leads to issues like lock-up, hangs etc.

Patch reverts commit cb7094 {cpuidle / omap4 : use CPUIDLE_FLAG_TIMER_STOP
flag} and 54769d6 {cpuidle: OMAP4: remove timer broadcast initialization} to
avoid the issue. With this change, OMAP4 panda boards, the mentioned
issues are getting fixed. We no longer loose interrupts which was the cause
of the regression.

Cc: Roger Quadros <rogerq@ti.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-tested-by: Roger Quadros <rogerq@ti.com>
Reported-tested-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
---
 arch/arm/mach-omap2/cpuidle44xx.c |   25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Alex Shi May 16, 2014, 12:33 a.m. UTC | #1

On 05/16/2014 02:29 AM, Santosh Shilimkar wrote:
> Alex,
> Please give a try with your test-case and see if you still see the hang.
> Am just curious about your issue and hence the request..

there is no test case. I just patched your patches, then boot new
kernel, system quickly to hang without any serial port oops, I am using
pandaES/ubuntu.

Santosh Shilimkar May 16, 2014, 1:43 p.m. UTC | #2

Tony,

On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote:
> On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote:
>> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote:
>>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote:
> 
> [..]
> 
>>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does
>>>>> its won registration where it can start the timer. The way it was before the
>>>>> consolidation.
>>>>>
>>>>> Ofcourse if you have better fix, then great.
>>>>>
>>>> What is your suggestion. We *must* fix the regression asap. I think
>>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED
>>>> seems a good way forward.
>>>>
>>>> Do let me know.
>>>
>>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang.
>>>
>> The hang is definitely due to the bctimer not started. As I said, I assumed it was and
>> then you corrected saying it is under the flag.
>>
>>> I am not convinced the culprit is this code you are trying to revert.
>>>
>> fair enough. Thats why I said if you have an alternative fix thats great.
>>
> For record, below is updated patch with bctimer started which
> was missed in earlier version. I haven't tested it though.
> 
> Alex,
> Please give a try with your test-case and see if you still see the hang.
> Am just curious about your issue and hence the request..
> 
Alex tested below patch and he don't see the hang so the patch is
addressing the issue.

If Daniel works out an alternate fix to avoid reverts, that will be great
but if not, we should merge the below patch. I let you take call on it.

> 
> 
> From bb3b82cc5645b83bedf1343d03cc956f27f6fc83 Mon Sep 17 00:00:00 2001
> From: Santosh Shilimkar <santosh.shilimkar@ti.com>
> Date: Mon, 12 May 2014 17:37:59 -0400
> Subject: [PATCH] ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled
> 
> On OMAP4 panda board, there have been several bug reports about boot
> hang and lock-ups with CPU_IDLE enabled. The root cause of the issue
> is missing interrupts while in idle state. Commit cb7094e8 {cpuidle / omap4 :
> use CPUIDLE_FLAG_TIMER_STOP flag} moved the broadcast notifiers to common
> code for right reasons but on OMAP4 which suffers from a nasty ROM code
> bug with GIC, commit ff999b8a {ARM: OMAP4460: Workaround for ROM bug ..},
> we loose interrupts which leads to issues like lock-up, hangs etc.
> 
> Patch reverts commit cb7094 {cpuidle / omap4 : use CPUIDLE_FLAG_TIMER_STOP
> flag} and 54769d6 {cpuidle: OMAP4: remove timer broadcast initialization} to
> avoid the issue. With this change, OMAP4 panda boards, the mentioned
> issues are getting fixed. We no longer loose interrupts which was the cause
> of the regression.
> 
> Cc: Roger Quadros <rogerq@ti.com>
> Cc: Kevin Hilman <khilman@linaro.org>
> Cc: Tony Lindgren <tony@atomide.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Reported-tested-by: Roger Quadros <rogerq@ti.com>
> Reported-tested-by: Kevin Hilman <khilman@linaro.org>
> Tested-by: Tony Lindgren <tony@atomide.com>
> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
> ---
>  arch/arm/mach-omap2/cpuidle44xx.c |   25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/mach-omap2/cpuidle44xx.c b/arch/arm/mach-omap2/cpuidle44xx.c
> index 01fc710..2498ab0 100644
> --- a/arch/arm/mach-omap2/cpuidle44xx.c
> +++ b/arch/arm/mach-omap2/cpuidle44xx.c
> @@ -14,6 +14,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/cpu_pm.h>
>  #include <linux/export.h>
> +#include <linux/clockchips.h>
>  
>  #include <asm/cpuidle.h>
>  #include <asm/proc-fns.h>
> @@ -83,6 +84,7 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
>  {
>  	struct idle_statedata *cx = state_ptr + index;
>  	u32 mpuss_can_lose_context = 0;
> +	int cpu_id = smp_processor_id();
>  
>  	/*
>  	 * CPU0 has to wait and stay ON until CPU1 is OFF state.
> @@ -110,6 +112,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
>  	mpuss_can_lose_context = (cx->mpu_state == PWRDM_POWER_RET) &&
>  				 (cx->mpu_logic_state == PWRDM_POWER_OFF);
>  
> +	clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu_id);
> +
>  	/*
>  	 * Call idle CPU PM enter notifier chain so that
>  	 * VFP and per CPU interrupt context is saved.
> @@ -165,6 +169,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
>  	if (dev->cpu == 0 && mpuss_can_lose_context)
>  		cpu_cluster_pm_exit();
>  
> +	clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu_id);
> +
>  fail:
>  	cpuidle_coupled_parallel_barrier(dev, &abort_barrier);
>  	cpu_done[dev->cpu] = false;
> @@ -172,6 +178,16 @@ fail:
>  	return index;
>  }
>  
> +/*
> + * For each cpu, setup the broadcast timer because local timers
> + * stops for the states above C1.
> + */
> +static void omap_setup_broadcast_timer(void *arg)
> +{
> +	int cpu = smp_processor_id();
> +	clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &cpu);
> +}
> +
>  static struct cpuidle_driver omap4_idle_driver = {
>  	.name				= "omap4_idle",
>  	.owner				= THIS_MODULE,
> @@ -189,8 +205,7 @@ static struct cpuidle_driver omap4_idle_driver = {
>  			/* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */
>  			.exit_latency = 328 + 440,
>  			.target_residency = 960,
> -			.flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
> -			         CPUIDLE_FLAG_TIMER_STOP,
> +			.flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
>  			.enter = omap_enter_idle_coupled,
>  			.name = "C2",
>  			.desc = "CPUx OFF, MPUSS CSWR",
> @@ -199,8 +214,7 @@ static struct cpuidle_driver omap4_idle_driver = {
>  			/* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */
>  			.exit_latency = 460 + 518,
>  			.target_residency = 1100,
> -			.flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
> -			         CPUIDLE_FLAG_TIMER_STOP,
> +			.flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
>  			.enter = omap_enter_idle_coupled,
>  			.name = "C3",
>  			.desc = "CPUx OFF, MPUSS OSWR",
> @@ -231,5 +245,8 @@ int __init omap4_idle_init(void)
>  	if (!cpu_clkdm[0] || !cpu_clkdm[1])
>  		return -ENODEV;
>  
> +	/* Configure the broadcast timer on each cpu */
> +	on_each_cpu(omap_setup_broadcast_timer, NULL, 1);
> +
>  	return cpuidle_register(&omap4_idle_driver, cpu_online_mask);
>  }
>

Tony Lindgren May 16, 2014, 9:29 p.m. UTC | #3

* Santosh Shilimkar <santosh.shilimkar@ti.com> [140516 06:43]:
> Tony,
> 
> On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote:
> > On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote:
> >> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote:
> >>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote:
> > 
> > [..]
> > 
> >>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does
> >>>>> its won registration where it can start the timer. The way it was before the
> >>>>> consolidation.
> >>>>>
> >>>>> Ofcourse if you have better fix, then great.
> >>>>>
> >>>> What is your suggestion. We *must* fix the regression asap. I think
> >>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED
> >>>> seems a good way forward.
> >>>>
> >>>> Do let me know.
> >>>
> >>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang.
> >>>
> >> The hang is definitely due to the bctimer not started. As I said, I assumed it was and
> >> then you corrected saying it is under the flag.
> >>
> >>> I am not convinced the culprit is this code you are trying to revert.
> >>>
> >> fair enough. Thats why I said if you have an alternative fix thats great.
> >>
> > For record, below is updated patch with bctimer started which
> > was missed in earlier version. I haven't tested it though.
> > 
> > Alex,
> > Please give a try with your test-case and see if you still see the hang.
> > Am just curious about your issue and hence the request..
> > 
> Alex tested below patch and he don't see the hang so the patch is
> addressing the issue.
> 
> If Daniel works out an alternate fix to avoid reverts, that will be great
> but if not, we should merge the below patch. I let you take call on it.
 
Daniel any news on this?

And just to recap, this problem can be reproduced with current
Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
system should hang during the boot at some point.

And for the record, the omap3 hang fix is now posted to the lists as
"[PATCH] ARM: OMAP2+: Fix DMA hang after off-idle". This should not
have anything to do with the omap4 cpu_idle hang as omap4 does not
currently lose context during idle.

Regards,

Tony

Daniel Lezcano May 19, 2014, 4:45 p.m. UTC | #4

On 05/16/2014 11:29 PM, Tony Lindgren wrote:
> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140516 06:43]:
>> Tony,
>>
>> On Thursday 15 May 2014 02:29 PM, Santosh Shilimkar wrote:
>>> On Thursday 15 May 2014 01:54 PM, Santosh Shilimkar wrote:
>>>> On Thursday 15 May 2014 01:50 PM, Daniel Lezcano wrote:
>>>>> On 05/15/2014 07:03 PM, Santosh Shilimkar wrote:
>>>
>>> [..]
>>>
>>>>>>> With above mentioned change, it should work. Other alternatives is OMAP4 driver does
>>>>>>> its won registration where it can start the timer. The way it was before the
>>>>>>> consolidation.
>>>>>>>
>>>>>>> Ofcourse if you have better fix, then great.
>>>>>>>
>>>>>> What is your suggestion. We *must* fix the regression asap. I think
>>>>>> $subject patch with an update to bctimer start under CPUIDLE_FLAG_COUPLED
>>>>>> seems a good way forward.
>>>>>>
>>>>>> Do let me know.
>>>>>
>>>>> Did you see Alex Shi's email [cc'ed] ? Reverting this change makes the panda ES to hang.
>>>>>
>>>> The hang is definitely due to the bctimer not started. As I said, I assumed it was and
>>>> then you corrected saying it is under the flag.
>>>>
>>>>> I am not convinced the culprit is this code you are trying to revert.
>>>>>
>>>> fair enough. Thats why I said if you have an alternative fix thats great.
>>>>
>>> For record, below is updated patch with bctimer started which
>>> was missed in earlier version. I haven't tested it though.
>>>
>>> Alex,
>>> Please give a try with your test-case and see if you still see the hang.
>>> Am just curious about your issue and hence the request..
>>>
>> Alex tested below patch and he don't see the hang so the patch is
>> addressing the issue.
>>
>> If Daniel works out an alternate fix to avoid reverts, that will be great
>> but if not, we should merge the below patch. I let you take call on it.
>
> Daniel any news on this?
>
> And just to recap, this problem can be reproduced with current
> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
> system should hang during the boot at some point.

I can take the time to investigate a bit more but not right now. What is 
your deadline before committing the reverts ?

> And for the record, the omap3 hang fix is now posted to the lists as
> "[PATCH] ARM: OMAP2+: Fix DMA hang after off-idle". This should not
> have anything to do with the omap4 cpu_idle hang as omap4 does not
> currently lose context during idle.

Tony Lindgren May 19, 2014, 5:23 p.m. UTC | #5

* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
> >
> >And just to recap, this problem can be reproduced with current
> >Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
> >system should hang during the boot at some point.
> 
> I can take the time to investigate a bit more but not right now. What is
> your deadline before committing the reverts ?

Well we do have several automated build and boot systems failing
because of this with multi_v7_defconfig. And users are complaining,
see this report from Tobias Jakobi:

https://bugzilla.kernel.org/show_bug.cgi?id=75421

It seems that doing the revert is not enough based on the
page above.

I'd prefer we'd fix this issue properly for sure, it seems that
we're not quite understanding what's going on. And this might
hit other platforms too when they start implementing deeper
PM idle states in the mainline kernel.

Regards,

Tony

Santosh Shilimkar May 19, 2014, 5:34 p.m. UTC | #6

On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
>> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
>>>
>>> And just to recap, this problem can be reproduced with current
>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
>>> system should hang during the boot at some point.
>>
>> I can take the time to investigate a bit more but not right now. What is
>> your deadline before committing the reverts ?
> 
> Well we do have several automated build and boot systems failing
> because of this with multi_v7_defconfig. And users are complaining,
> see this report from Tobias Jakobi:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=75421
> 
> It seems that doing the revert is not enough based on the
> page above.
>
Thats not true. The above link used the half patch and not the
updated patch. Updated patch worked for Alex also. As you can
see they saw RCU stalls and they go away after the updated patch.

Can you please point them to try out the updated patch ?
 
> I'd prefer we'd fix this issue properly for sure, it seems that
> we're not quite understanding what's going on. And this might
> hit other platforms too when they start implementing deeper
> PM idle states in the mainline kernel.
> 
I am certain that the updated patch fixed the regression
for sure. The issue is really not generic enough since its related
an OMAP ROM errata which needs that special handling of
interrupt re-trigger etc. You don't need that for other platforms
so they are not likely get affected.

Tony Lindgren May 19, 2014, 5:51 p.m. UTC | #7

* Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]:
> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
> > * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
> >> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
> >>>
> >>> And just to recap, this problem can be reproduced with current
> >>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
> >>> system should hang during the boot at some point.
> >>
> >> I can take the time to investigate a bit more but not right now. What is
> >> your deadline before committing the reverts ?
> > 
> > Well we do have several automated build and boot systems failing
> > because of this with multi_v7_defconfig. And users are complaining,
> > see this report from Tobias Jakobi:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=75421
> > 
> > It seems that doing the revert is not enough based on the
> > page above.
> >
> Thats not true. The above link used the half patch and not the
> updated patch. Updated patch worked for Alex also. As you can
> see they saw RCU stalls and they go away after the updated patch.
> 
> Can you please point them to try out the updated patch ?

OK good point. I added a link to the updated patch in
bugzilla.
  
> > I'd prefer we'd fix this issue properly for sure, it seems that
> > we're not quite understanding what's going on. And this might
> > hit other platforms too when they start implementing deeper
> > PM idle states in the mainline kernel.
> > 
> I am certain that the updated patch fixed the regression
> for sure. The issue is really not generic enough since its related
> an OMAP ROM errata which needs that special handling of
> interrupt re-trigger etc. You don't need that for other platforms
> so they are not likely get affected.

OK makes sense to me considering the ROM code. Daniel, are you OK
with that or do you still want to investigate further?

Regards,

Tony

Daniel Lezcano May 19, 2014, 6:06 p.m. UTC | #8

On 05/19/2014 07:51 PM, Tony Lindgren wrote:
> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]:
>> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
>>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
>>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
>>>>>
>>>>> And just to recap, this problem can be reproduced with current
>>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
>>>>> system should hang during the boot at some point.
>>>>
>>>> I can take the time to investigate a bit more but not right now. What is
>>>> your deadline before committing the reverts ?
>>>
>>> Well we do have several automated build and boot systems failing
>>> because of this with multi_v7_defconfig. And users are complaining,
>>> see this report from Tobias Jakobi:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=75421
>>>
>>> It seems that doing the revert is not enough based on the
>>> page above.
>>>
>> Thats not true. The above link used the half patch and not the
>> updated patch. Updated patch worked for Alex also. As you can
>> see they saw RCU stalls and they go away after the updated patch.
>>
>> Can you please point them to try out the updated patch ?
>
> OK good point. I added a link to the updated patch in
> bugzilla.
>
>>> I'd prefer we'd fix this issue properly for sure, it seems that
>>> we're not quite understanding what's going on. And this might
>>> hit other platforms too when they start implementing deeper
>>> PM idle states in the mainline kernel.
>>>
>> I am certain that the updated patch fixed the regression
>> for sure. The issue is really not generic enough since its related
>> an OMAP ROM errata which needs that special handling of
>> interrupt re-trigger etc. You don't need that for other platforms
>> so they are not likely get affected.
>
> OK makes sense to me considering the ROM code. Daniel, are you OK
> with that or do you still want to investigate further?

For the moment I am a bit short in time for some other tasks. So feel 
free to apply the revert and I will look for a proper fix when I will 
have time.

Thanks
   -- Daniel

Tony Lindgren May 19, 2014, 7:36 p.m. UTC | #9

* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]:
> On 05/19/2014 07:51 PM, Tony Lindgren wrote:
> >* Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]:
> >>On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
> >>>* Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
> >>>>On 05/16/2014 11:29 PM, Tony Lindgren wrote:
> >>>>>
> >>>>>And just to recap, this problem can be reproduced with current
> >>>>>Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
> >>>>>system should hang during the boot at some point.
> >>>>
> >>>>I can take the time to investigate a bit more but not right now. What is
> >>>>your deadline before committing the reverts ?
> >>>
> >>>Well we do have several automated build and boot systems failing
> >>>because of this with multi_v7_defconfig. And users are complaining,
> >>>see this report from Tobias Jakobi:
> >>>
> >>>https://bugzilla.kernel.org/show_bug.cgi?id=75421
> >>>
> >>>It seems that doing the revert is not enough based on the
> >>>page above.
> >>>
> >>Thats not true. The above link used the half patch and not the
> >>updated patch. Updated patch worked for Alex also. As you can
> >>see they saw RCU stalls and they go away after the updated patch.
> >>
> >>Can you please point them to try out the updated patch ?
> >
> >OK good point. I added a link to the updated patch in
> >bugzilla.
> >
> >>>I'd prefer we'd fix this issue properly for sure, it seems that
> >>>we're not quite understanding what's going on. And this might
> >>>hit other platforms too when they start implementing deeper
> >>>PM idle states in the mainline kernel.
> >>>
> >>I am certain that the updated patch fixed the regression
> >>for sure. The issue is really not generic enough since its related
> >>an OMAP ROM errata which needs that special handling of
> >>interrupt re-trigger etc. You don't need that for other platforms
> >>so they are not likely get affected.
> >
> >OK makes sense to me considering the ROM code. Daniel, are you OK
> >with that or do you still want to investigate further?
> 
> For the moment I am a bit short in time for some other tasks. So feel free
> to apply the revert and I will look for a proper fix when I will have time.

Added Tobias to Cc. At the bugzilla link Tobias is saying
he used the right patch from Santosh to test and it still 
fails.

Regards,

Tony

Daniel Lezcano May 19, 2014, 7:45 p.m. UTC | #10

On 05/19/2014 09:36 PM, Tony Lindgren wrote:
> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]:
>> On 05/19/2014 07:51 PM, Tony Lindgren wrote:
>>> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]:
>>>> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
>>>>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
>>>>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
>>>>>>>
>>>>>>> And just to recap, this problem can be reproduced with current
>>>>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
>>>>>>> system should hang during the boot at some point.
>>>>>>
>>>>>> I can take the time to investigate a bit more but not right now. What is
>>>>>> your deadline before committing the reverts ?
>>>>>
>>>>> Well we do have several automated build and boot systems failing
>>>>> because of this with multi_v7_defconfig. And users are complaining,
>>>>> see this report from Tobias Jakobi:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=75421
>>>>>
>>>>> It seems that doing the revert is not enough based on the
>>>>> page above.
>>>>>
>>>> Thats not true. The above link used the half patch and not the
>>>> updated patch. Updated patch worked for Alex also. As you can
>>>> see they saw RCU stalls and they go away after the updated patch.
>>>>
>>>> Can you please point them to try out the updated patch ?
>>>
>>> OK good point. I added a link to the updated patch in
>>> bugzilla.
>>>
>>>>> I'd prefer we'd fix this issue properly for sure, it seems that
>>>>> we're not quite understanding what's going on. And this might
>>>>> hit other platforms too when they start implementing deeper
>>>>> PM idle states in the mainline kernel.
>>>>>
>>>> I am certain that the updated patch fixed the regression
>>>> for sure. The issue is really not generic enough since its related
>>>> an OMAP ROM errata which needs that special handling of
>>>> interrupt re-trigger etc. You don't need that for other platforms
>>>> so they are not likely get affected.
>>>
>>> OK makes sense to me considering the ROM code. Daniel, are you OK
>>> with that or do you still want to investigate further?
>>
>> For the moment I am a bit short in time for some other tasks. So feel free
>> to apply the revert and I will look for a proper fix when I will have time.
>
> Added Tobias to Cc. At the bugzilla link Tobias is saying
> he used the right patch from Santosh to test and it still
> fails.

So apparently Santosh, Kevin and Alex say they are not facing the issue 
anymore with the revert. But Tobias is still facing this issue.

There is this simple program [1] which usually makes cpuidle to hang the 
system more quickly when there is a problem somewhere. May be worth to 
check with it.

Hope that helps.

   -- Daniel

[1] 
https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c

Santosh Shilimkar May 19, 2014, 8 p.m. UTC | #11

On Monday 19 May 2014 03:36 PM, Tony Lindgren wrote:
> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 11:07]:
>> On 05/19/2014 07:51 PM, Tony Lindgren wrote:
>>> * Santosh Shilimkar <santosh.shilimkar@ti.com> [140519 10:35]:
>>>> On Monday 19 May 2014 01:23 PM, Tony Lindgren wrote:
>>>>> * Daniel Lezcano <daniel.lezcano@linaro.org> [140519 09:46]:
>>>>>> On 05/16/2014 11:29 PM, Tony Lindgren wrote:
>>>>>>>
>>>>>>> And just to recap, this problem can be reproduced with current
>>>>>>> Linux next with omap2plus_defconfig with CONFIG_CPU_IDLE enabled. The
>>>>>>> system should hang during the boot at some point.
>>>>>>
>>>>>> I can take the time to investigate a bit more but not right now. What is
>>>>>> your deadline before committing the reverts ?
>>>>>
>>>>> Well we do have several automated build and boot systems failing
>>>>> because of this with multi_v7_defconfig. And users are complaining,
>>>>> see this report from Tobias Jakobi:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=75421
>>>>>
>>>>> It seems that doing the revert is not enough based on the
>>>>> page above.
>>>>>
>>>> Thats not true. The above link used the half patch and not the
>>>> updated patch. Updated patch worked for Alex also. As you can
>>>> see they saw RCU stalls and they go away after the updated patch.
>>>>
>>>> Can you please point them to try out the updated patch ?
>>>
>>> OK good point. I added a link to the updated patch in
>>> bugzilla.
>>>
>>>>> I'd prefer we'd fix this issue properly for sure, it seems that
>>>>> we're not quite understanding what's going on. And this might
>>>>> hit other platforms too when they start implementing deeper
>>>>> PM idle states in the mainline kernel.
>>>>>
>>>> I am certain that the updated patch fixed the regression
>>>> for sure. The issue is really not generic enough since its related
>>>> an OMAP ROM errata which needs that special handling of
>>>> interrupt re-trigger etc. You don't need that for other platforms
>>>> so they are not likely get affected.
>>>
>>> OK makes sense to me considering the ROM code. Daniel, are you OK
>>> with that or do you still want to investigate further?
>>
>> For the moment I am a bit short in time for some other tasks. So feel free
>> to apply the revert and I will look for a proper fix when I will have time.
> 
> Added Tobias to Cc. At the bugzilla link Tobias is saying
> he used the right patch from Santosh to test and it still 
> fails.
> 
The link of the patch in bugzilla shows that it was not the updated
patch and that was my reference point. The other bugzilla issue seems to
be different and mostly not related to cpuidle. I could be wrong.

----------------------

dmesg | grep idle:
cpuidle: using governor ladder
cpuidle: using governor menu

grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*:
(now attached as 'cpuidle states')

The 'BUG' looks like this:
BUG: scheduling while atomic: swapper/1/0/0xffff0000
Modules linked in: ctr ccm btrfs xor xor_neon zlib_inflate zlib_deflate omapfb cfbfillrect cfbimgblt cfbcopyarea raid6_pq arc4 wl12xx wlcore mac80211 cfg80211 rfkill usb_storage snd_soc_omap_hdmi snd_soc_omap_hdmi_card wlcore_sdio
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W     3.15.0-rc5+ #1
Backtrace: 
[<c001152c>] (dump_backtrace) from [<c00116c8>] (show_stack+0x18/0x1c)
 r6:eb894000 r5:00000001 r4:00000000 r3:00000000
[<c00116b0>] (show_stack) from [<c0430a94>] (dump_stack+0x7c/0x98)
[<c0430a18>] (dump_stack) from [<c042ec10>] (__schedule_bug+0x48/0x60)
 r4:00000000 r3:00000153
[<c042ebc8>] (__schedule_bug) from [<c0432248>] (__schedule+0x41c/0x4b4)
 r4:2b9fe000 r3:00000000
[<c0431e2c>] (__schedule) from [<c04323d4>] (schedule+0x38/0x88)
 r10:eb894000 r9:eb894000 r8:00000000 r7:eb894000 r6:c05b0450 r5:c05b04a0
 r4:eb894000
[<c043239c>] (schedule) from [<c04326fc>] (schedule_preempt_disabled+0x10/0x14)
[<c04326ec>] (schedule_preempt_disabled) from [<c006d8f0>] (cpu_startup_entry+0xec/0x22c)
[<c006d804>] (cpu_startup_entry) from [<c00131a4>] (secondary_start_kernel+0xe8/0x100)
 r7:c05de240
[<c00130bc>] (secondary_start_kernel) from [<80008664>] (0x80008664)
 r4:ab87c06a r3:c000864c
----------------------

Tobias Jakobi May 19, 2014, 9:18 p.m. UTC | #12

Daniel Lezcano wrote:
>
> So apparently Santosh, Kevin and Alex say they are not facing the
> issue anymore with the revert. But Tobias is still facing this issue.
>
> There is this simple program [1] which usually makes cpuidle to hang
> the system more quickly when there is a problem somewhere. May be
> worth to check with it.
>
> Hope that helps.
>
>   -- Daniel
>
> [1]
> https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c
>
Hello,

just to clarify. I never encountered the boot failure problem on my
Panda ES in the first place. The issue with CPU_IDLE for me, like I also
say in the bugreport, showed itself through instability of the USB. I
saw the boot issue on the ml, read that it was caused by CPU_IDLE, and
thought I should make Tony aware that I have a similar problem.

Now, after applying the latest version of the patch that Santosh
proposed, the situation with CPU_IDLE improves. But it is still not
stable for me, since after some while, working with an SSH session of
WiFi (which IIRC is not connected via USB on the board), this 'BUG'
storm hits. This does _not_ happen when I disable CPU_IDLE completly.

But even if I don't connect via WiFi at all, just boot and let me system
run with serial console connected, after some time I get a kernel 'WARNING':
http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log

So something with CPU_IDLE is not right here. However I can't say if
it's exactly this boot issue which this thread was initially about.
Maybe seperate issues are at work here, who knows.

Anyway, I'm going to have a look at that cpuidle killer thingy later.

-- Tobias

Tony Lindgren May 19, 2014, 10:42 p.m. UTC | #13

* Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]:
> Daniel Lezcano wrote:
> >
> > So apparently Santosh, Kevin and Alex say they are not facing the
> > issue anymore with the revert. But Tobias is still facing this issue.
> >
> > There is this simple program [1] which usually makes cpuidle to hang
> > the system more quickly when there is a problem somewhere. May be
> > worth to check with it.
> >
> > Hope that helps.
> >
> >   -- Daniel
> >
> > [1]
> > https://git.linaro.org/power/pm-qa.git/blob/HEAD:/cpuidle/cpuidle_killer.c
> >
> Hello,
> 
> just to clarify. I never encountered the boot failure problem on my
> Panda ES in the first place. The issue with CPU_IDLE for me, like I also
> say in the bugreport, showed itself through instability of the USB. I
> saw the boot issue on the ml, read that it was caused by CPU_IDLE, and
> thought I should make Tony aware that I have a similar problem.
> 
> Now, after applying the latest version of the patch that Santosh
> proposed, the situation with CPU_IDLE improves. But it is still not
> stable for me, since after some while, working with an SSH session of
> WiFi (which IIRC is not connected via USB on the board), this 'BUG'
> storm hits. This does _not_ happen when I disable CPU_IDLE completly.

OK that seems like a different problem.
 
> But even if I don't connect via WiFi at all, just boot and let me system
> run with serial console connected, after some time I get a kernel 'WARNING':
> http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log
> 
> So something with CPU_IDLE is not right here. However I can't say if
> it's exactly this boot issue which this thread was initially about.
> Maybe seperate issues are at work here, who knows.
> 
> Anyway, I'm going to have a look at that cpuidle killer thingy later.

OK. I gave it a try, but did not get any errors at least after about
30 minutes of testing on my panda es.

I'll apply the patch from Santosh as that gets things booting again
with linux next.

Regards,

Tony

Tony Lindgren May 23, 2014, 2:44 p.m. UTC | #14

* Tobias Jakobi <tjakobi@math.uni-bielefeld.de> [140519 14:19]:
> 
> But even if I don't connect via WiFi at all, just boot and let me system
> run with serial console connected, after some time I get a kernel 'WARNING':
> http://www.math.uni-bielefeld.de/~tjakobi/archive/dmesg.1.log

BTW, care to update the bugzilla page with the second warning
in this log?

That's the WARNING: CPU: 1 PID: 0 at kernel/timer.c:1147 that's
at 238 seconds.

Regards,

Tony

ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled

Commit Message

Comments

Patch