diff mbox

Crash after 'reboot' due to 9be4fd2c7723a

Message ID 2148724.oQlEHos3jS@vostro.rjw.lan (mailing list archive)
State New, archived
Headers show

Commit Message

Rafael J. Wysocki May 21, 2016, 12:56 a.m. UTC
On Friday, May 20, 2016 09:36:52 PM Fabio Estevam wrote:
> On Fri, May 20, 2016 at 9:30 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> 
> > What exactly do you mean by reverting here?  It surely cannot be
> > reverted by itself and there are many things that depend on it.
> 
> Yes, it cannot be reverted cleanly on 4.6.
> 
> If I do 'git checkout 9be4fd2c7723a30' and then 'git revert
> 9be4fd2c7723a30' then the reboot command works.

OK, so in fact you need to go all the way to the parent of 9be4fd2c7723a30.

> >>> Is this an SMP board, actually?
> >>
> >> Yes, this is a single core CortexA7.
> >
> > Hmm.  Single core means non-SMP rather?
> 
> CONFIG_SMP can be set or not in the kernel config. With CONFIG_SMP=n
> the issue does not happen.

Yes, you can run an SMP kernel on a non-SMP board too, which is what leads to
the problem you're seeing.

If I'm not mistaken, the patch below should allow irq_work to make forward
progress for you, so please check if it makes any difference.

The root of the problem seems to be arch_irq_work_raise() and specifically
the __smp_cross_call function that appears to have problems.

I bet that is_smp() should return false on your platform, but it returns
true for some reason.

---
 drivers/cpufreq/cpufreq_governor.c |    1 +
 1 file changed, 1 insertion(+)

Comments

Fabio Estevam May 21, 2016, 1:19 a.m. UTC | #1
Hi Rafael,

On Fri, May 20, 2016 at 9:56 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> If I'm not mistaken, the patch below should allow irq_work to make forward
> progress for you, so please check if it makes any difference.
....

> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -362,6 +362,7 @@ static struct policy_dbs_info *alloc_pol
>         mutex_init(&policy_dbs->timer_mutex);
>         atomic_set(&policy_dbs->work_count, 0);
>         init_irq_work(&policy_dbs->irq_work, dbs_irq_work);
> +       policy_dbs->irq_work.flags = IRQ_WORK_LAZY;
>         INIT_WORK(&policy_dbs->work, dbs_work_handler);
>
>         /* Set policy_dbs for all CPUs, online+offline */

Running a 4.6 kernel with this patch applied allows me to reboot the
system, thanks.

'echo performance >
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' works fine now.
Fabio Estevam May 21, 2016, 1:22 a.m. UTC | #2
On Fri, May 20, 2016 at 10:19 PM, Fabio Estevam <festevam@gmail.com> wrote:

> Running a 4.6 kernel with this patch applied allows me to reboot the
> system, thanks.
>
> 'echo performance >
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' works fine now.

and as a bonus point suspend/resume works fine now too :-)
Rafael J. Wysocki May 21, 2016, 1:31 a.m. UTC | #3
On Sat, May 21, 2016 at 3:22 AM, Fabio Estevam <festevam@gmail.com> wrote:
> On Fri, May 20, 2016 at 10:19 PM, Fabio Estevam <festevam@gmail.com> wrote:
>
>> Running a 4.6 kernel with this patch applied allows me to reboot the
>> system, thanks.
>>
>> 'echo performance >
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' works fine now.
>
> and as a bonus point suspend/resume works fine now too :-)

OK, thanks for the confirmation!

The patch is just a workaround for a problem in the arch code showing
up on your platform, though.
Fabio Estevam May 23, 2016, 5:13 p.m. UTC | #4
On Fri, May 20, 2016 at 10:31 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:

> The patch is just a workaround for a problem in the arch code showing
> up on your platform, though.

Not sure where the problem in the arch code is located though.

Ping, Shawn, Sascha, any ideas?

Thanks
Russell King (Oracle) May 23, 2016, 6:28 p.m. UTC | #5
On Sat, May 21, 2016 at 02:56:20AM +0200, Rafael J. Wysocki wrote:
> The root of the problem seems to be arch_irq_work_raise() and specifically
> the __smp_cross_call function that appears to have problems.

We've been through this before.  The bottom line is that on ARM, there
is major wide-spread understanding that we want to be able to run a
kernel compiled for SMP on uniprocessor hardware.

This is something that's been going for years, and has worked fine for
years.

Now, someone has introduced this irq work stuff.  Great.  But they
haven't considered that people want to be able to run a SMP kernel
on UP hardware which _may_ have no hardware present which is capable
of raising IPIs.

I don't know what the situation is with the platform concerned here,
I don't know whether it uses the GIC (presumably, because it doesn't
NULL-pointer ref on calling smp_cross_call(), it is using the GIC.)
If so, the GIC isn't delivering the IPI because the IPI hardware is
missing.

Now, whether we could detect whether the GIC is IPI capable, I've
no idea, I don't have such a platform I could run tests on.  People
don't give me hardware anymore now that arm-soc is split off from
core ARM maintanence - it all goes into Linaro build farms.  So I'm
powerless to help here.

My attitude towards this is that it's a core kernel problem: the
core kernel is assuming that it can raise IPIs even on non-SMP
capable hardware.  Much non-SMP hardware doesn't have that ability.
While folk can try to push it into arch code, my feeling is that it
really needs to have a generic solution, even if it's some generic
solution that architectures in this situation can plug into.
Rafael J. Wysocki May 23, 2016, 8:16 p.m. UTC | #6
On Mon, May 23, 2016 at 8:28 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Sat, May 21, 2016 at 02:56:20AM +0200, Rafael J. Wysocki wrote:
>> The root of the problem seems to be arch_irq_work_raise() and specifically
>> the __smp_cross_call function that appears to have problems.
>
> We've been through this before.  The bottom line is that on ARM, there
> is major wide-spread understanding that we want to be able to run a
> kernel compiled for SMP on uniprocessor hardware.
>
> This is something that's been going for years, and has worked fine for
> years.
>
> Now, someone has introduced this irq work stuff.  Great.  But they
> haven't considered that people want to be able to run a SMP kernel
> on UP hardware which _may_ have no hardware present which is capable
> of raising IPIs.
>
> I don't know what the situation is with the platform concerned here,
> I don't know whether it uses the GIC (presumably, because it doesn't
> NULL-pointer ref on calling smp_cross_call(), it is using the GIC.)
> If so, the GIC isn't delivering the IPI because the IPI hardware is
> missing.
>
> Now, whether we could detect whether the GIC is IPI capable, I've
> no idea, I don't have such a platform I could run tests on.  People
> don't give me hardware anymore now that arm-soc is split off from
> core ARM maintanence - it all goes into Linaro build farms.  So I'm
> powerless to help here.
>
> My attitude towards this is that it's a core kernel problem: the
> core kernel is assuming that it can raise IPIs even on non-SMP
> capable hardware.  Much non-SMP hardware doesn't have that ability.
> While folk can try to push it into arch code, my feeling is that it
> really needs to have a generic solution, even if it's some generic
> solution that architectures in this situation can plug into.

Well, this particular case isn't about IPIs at all.

irq_work_queue() can work without IPIs and it works like that on other
ARM platforms where SMP kernels are run on UP hardware (without IPI
support).

Something seems to be missing in kernel config or similar here, but I
can't really say what it is right away.
Rafael J. Wysocki May 23, 2016, 8:23 p.m. UTC | #7
On Mon, May 23, 2016 at 10:16 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Mon, May 23, 2016 at 8:28 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
>> On Sat, May 21, 2016 at 02:56:20AM +0200, Rafael J. Wysocki wrote:
>>> The root of the problem seems to be arch_irq_work_raise() and specifically
>>> the __smp_cross_call function that appears to have problems.
>>
>> We've been through this before.  The bottom line is that on ARM, there
>> is major wide-spread understanding that we want to be able to run a
>> kernel compiled for SMP on uniprocessor hardware.
>>
>> This is something that's been going for years, and has worked fine for
>> years.
>>
>> Now, someone has introduced this irq work stuff.  Great.  But they
>> haven't considered that people want to be able to run a SMP kernel
>> on UP hardware which _may_ have no hardware present which is capable
>> of raising IPIs.
>>
>> I don't know what the situation is with the platform concerned here,
>> I don't know whether it uses the GIC (presumably, because it doesn't
>> NULL-pointer ref on calling smp_cross_call(), it is using the GIC.)
>> If so, the GIC isn't delivering the IPI because the IPI hardware is
>> missing.
>>
>> Now, whether we could detect whether the GIC is IPI capable, I've
>> no idea, I don't have such a platform I could run tests on.  People
>> don't give me hardware anymore now that arm-soc is split off from
>> core ARM maintanence - it all goes into Linaro build farms.  So I'm
>> powerless to help here.
>>
>> My attitude towards this is that it's a core kernel problem: the
>> core kernel is assuming that it can raise IPIs even on non-SMP
>> capable hardware.  Much non-SMP hardware doesn't have that ability.
>> While folk can try to push it into arch code, my feeling is that it
>> really needs to have a generic solution, even if it's some generic
>> solution that architectures in this situation can plug into.
>
> Well, this particular case isn't about IPIs at all.
>
> irq_work_queue() can work without IPIs and it works like that on other
> ARM platforms where SMP kernels are run on UP hardware (without IPI
> support).
>
> Something seems to be missing in kernel config or similar here, but I
> can't really say what it is right away.

For one, if that platform is not capable of raising interrupts for IRQ
works, I'm not sure why arch_irq_work_has_interrupt() returns true on
it.
Fabio Estevam May 23, 2016, 8:59 p.m. UTC | #8
On Mon, May 23, 2016 at 5:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:

> For one, if that platform is not capable of raising interrupts for IRQ
> works, I'm not sure why arch_irq_work_has_interrupt() returns true on
> it.

It returns true because the kernel was built with CONFIG_SMP=y.

On ARM we have:

static inline bool arch_irq_work_has_interrupt(void)
{
    return is_smp();
}

and then:

static inline bool is_smp(void)
{
#ifndef CONFIG_SMP
    return false;
#elif defined(CONFIG_SMP_ON_UP)
    extern unsigned int smp_on_up;
    return !!smp_on_up;
#else
    return true;
#endif
}

So if CONFIG_SMP=y,  is_smp() returns 1.

As I mentioned before, the reboot problem does not happen with CONFIG_SMP=n.
Rafael J. Wysocki May 23, 2016, 9:03 p.m. UTC | #9
On Mon, May 23, 2016 at 10:59 PM, Fabio Estevam <festevam@gmail.com> wrote:
> On Mon, May 23, 2016 at 5:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
>> For one, if that platform is not capable of raising interrupts for IRQ
>> works, I'm not sure why arch_irq_work_has_interrupt() returns true on
>> it.
>
> It returns true because the kernel was built with CONFIG_SMP=y.
>
> On ARM we have:
>
> static inline bool arch_irq_work_has_interrupt(void)
> {
>     return is_smp();
> }
>
> and then:
>
> static inline bool is_smp(void)
> {
> #ifndef CONFIG_SMP
>     return false;
> #elif defined(CONFIG_SMP_ON_UP)
>     extern unsigned int smp_on_up;
>     return !!smp_on_up;
> #else
>     return true;
> #endif
> }
>
> So if CONFIG_SMP=y,  is_smp() returns 1.
>
> As I mentioned before, the reboot problem does not happen with CONFIG_SMP=n.

First, why don't the other ARM UP platforms have this problem when SMP
kernels are run on them?

Second, quite evidently, the platform says "I can raise interrupts for
IRQ works", but then it doesn't do that.  That doesn't seem
particularly consistent to me ...
Russell King (Oracle) May 23, 2016, 9:07 p.m. UTC | #10
On Mon, May 23, 2016 at 05:59:37PM -0300, Fabio Estevam wrote:
> On Mon, May 23, 2016 at 5:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > For one, if that platform is not capable of raising interrupts for IRQ
> > works, I'm not sure why arch_irq_work_has_interrupt() returns true on
> > it.
> 
> It returns true because the kernel was built with CONFIG_SMP=y.
> 
> On ARM we have:
> 
> static inline bool arch_irq_work_has_interrupt(void)
> {
>     return is_smp();
> }
> 
> and then:
> 
> static inline bool is_smp(void)
> {
> #ifndef CONFIG_SMP
>     return false;
> #elif defined(CONFIG_SMP_ON_UP)
>     extern unsigned int smp_on_up;
>     return !!smp_on_up;
> #else
>     return true;
> #endif
> }
> 
> So if CONFIG_SMP=y,  is_smp() returns 1.
> 
> As I mentioned before, the reboot problem does not happen with CONFIG_SMP=n.

Read the code again.  There's three states:

1. CONFIG_SMP=n - it always returns false.
2. CONFIG_SMP=y, CONFIG_SMP_ON_UP=n - it always returns true.  Such a
   kernel can _only_ be run on SMP capable systems.
3. CONFIG_SMP=y, CONFIG_SMP_ON_UP=y - it returns depending on whether
   we detect that the CPU is SMP capable.
Russell King (Oracle) May 23, 2016, 9:11 p.m. UTC | #11
On Mon, May 23, 2016 at 11:03:01PM +0200, Rafael J. Wysocki wrote:
> On Mon, May 23, 2016 at 10:59 PM, Fabio Estevam <festevam@gmail.com> wrote:
> > On ARM we have:
> >
> > static inline bool arch_irq_work_has_interrupt(void)
> > {
> >     return is_smp();
> > }
> >
> > and then:
> >
> > static inline bool is_smp(void)
> > {
> > #ifndef CONFIG_SMP
> >     return false;
> > #elif defined(CONFIG_SMP_ON_UP)
> >     extern unsigned int smp_on_up;
> >     return !!smp_on_up;
> > #else
> >     return true;
> > #endif
> > }
> >
> > So if CONFIG_SMP=y,  is_smp() returns 1.
> >
> > As I mentioned before, the reboot problem does not happen with CONFIG_SMP=n.
> 
> First, why don't the other ARM UP platforms have this problem when SMP
> kernels are run on them?
> 
> Second, quite evidently, the platform says "I can raise interrupts for
> IRQ works", but then it doesn't do that.  That doesn't seem
> particularly consistent to me ...

Maybe someone has implemented a SoC which has a CPU capable of SMP
but the GIC isn't...  I'm just guessing again.
Fabio Estevam May 23, 2016, 9:38 p.m. UTC | #12
On Mon, May 23, 2016 at 6:07 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:

> Read the code again.  There's three states:
>
> 1. CONFIG_SMP=n - it always returns false.
> 2. CONFIG_SMP=y, CONFIG_SMP_ON_UP=n - it always returns true.  Such a
>    kernel can _only_ be run on SMP capable systems.
> 3. CONFIG_SMP=y, CONFIG_SMP_ON_UP=y - it returns depending on whether
>    we detect that the CPU is SMP capable.

Yes, I meant that on imx6ul we have option 2 above with imx_v6_v7_defconfig.

Also tested option 3 and is_smp() returns true as well.

I will try to see why GIC on this SoC cannot work in SMP.
Russell King (Oracle) May 23, 2016, 10:05 p.m. UTC | #13
On Mon, May 23, 2016 at 06:38:31PM -0300, Fabio Estevam wrote:
> On Mon, May 23, 2016 at 6:07 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> 
> > Read the code again.  There's three states:
> >
> > 1. CONFIG_SMP=n - it always returns false.
> > 2. CONFIG_SMP=y, CONFIG_SMP_ON_UP=n - it always returns true.  Such a
> >    kernel can _only_ be run on SMP capable systems.
> > 3. CONFIG_SMP=y, CONFIG_SMP_ON_UP=y - it returns depending on whether
> >    we detect that the CPU is SMP capable.
> 
> Yes, I meant that on imx6ul we have option 2 above with imx_v6_v7_defconfig.

I don't think so, unless you specifically turn off CONFIG_SMP_ON_UP.
It defaults to 'y', so imx_v6_v7_defconfig should result in this being
'y'.

> Also tested option 3 and is_smp() returns true as well.

So that means the MPIDR register is telling the kernel...

* The boot CPU has multiprocessor extensions.
* The boot CPU is not part of a uniprocessor system.

> I will try to see why GIC on this SoC cannot work in SMP.
Fabio Estevam May 23, 2016, 10:25 p.m. UTC | #14
On Mon, May 23, 2016 at 7:05 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:

> I don't think so, unless you specifically turn off CONFIG_SMP_ON_UP.
> It defaults to 'y', so imx_v6_v7_defconfig should result in this being
> 'y'.

Sorry, you are right: option 3 (CONFIG_SMP=y and CONFIG_SMP_ON_UP=y)
is the default one with imx_v6_v7_defconfig.
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -362,6 +362,7 @@  static struct policy_dbs_info *alloc_pol
 	mutex_init(&policy_dbs->timer_mutex);
 	atomic_set(&policy_dbs->work_count, 0);
 	init_irq_work(&policy_dbs->irq_work, dbs_irq_work);
+	policy_dbs->irq_work.flags = IRQ_WORK_LAZY;
 	INIT_WORK(&policy_dbs->work, dbs_work_handler);
 
 	/* Set policy_dbs for all CPUs, online+offline */