diff mbox

ARM: tegra: disable nonboot CPUs when reboot

Message ID 1370597810-1153-1-git-send-email-josephl@nvidia.com (mailing list archive)
State New, archived
Headers show

Commit Message

Joseph Lo June 7, 2013, 9:36 a.m. UTC
The normal CPU hotplug flow in kernel and the flow for Tegra we expected,
is checking the CPU ID is OK for hotplug by "tegra_cpu_disable", the CPU
that would be hotplugged runs into a power-gate state by "tegra_cpu_die",
then the other CPU waits for the CPU that was hotplugged in reset and
clock gate it by "tegra_cpu_kill". That means we don't support the CPU
being stopped or put into offline by trigger "tegra_cpu_kill" directly.
It may cause a busy loop for waiting CPU in reset.

After the commit "62e930e reboot: rigrate shutdown/reboot to boot cpu",
we remove "disable_nonboot_cpus" when kernel_{restart,halt,power_off}.
But the ARM kernel trigger "send_smp_stop" when machine_shutdown, that
would cause the "tegra_cpu_kill" directly without "tegra_cpu_die" first.

We hook "disable_nonboot_cpus" in "reboot_notifier" to avoid that happens.
And it can work for reboot, shutdown, halt and kexec.

Signed-off-by: Joseph Lo <josephl@nvidia.com>
---
 arch/arm/mach-tegra/hotplug.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Stephen Warren June 7, 2013, 4:44 p.m. UTC | #1
On 06/07/2013 03:36 AM, Joseph Lo wrote:
> The normal CPU hotplug flow in kernel and the flow for Tegra we expected,
> is checking the CPU ID is OK for hotplug by "tegra_cpu_disable", the CPU
> that would be hotplugged runs into a power-gate state by "tegra_cpu_die",
> then the other CPU waits for the CPU that was hotplugged in reset and
> clock gate it by "tegra_cpu_kill". That means we don't support the CPU
> being stopped or put into offline by trigger "tegra_cpu_kill" directly.
> It may cause a busy loop for waiting CPU in reset.
> 
> After the commit "62e930e reboot: rigrate shutdown/reboot to boot cpu",
> we remove "disable_nonboot_cpus" when kernel_{restart,halt,power_off}.
> But the ARM kernel trigger "send_smp_stop" when machine_shutdown, that
> would cause the "tegra_cpu_kill" directly without "tegra_cpu_die" first.
> 
> We hook "disable_nonboot_cpus" in "reboot_notifier" to avoid that happens.
> And it can work for reboot, shutdown, halt and kexec.

I don't believe this is the correct solution.

If the semantics of cpu_kill/cpu_die are such that it's legal to call
only cpu_kill without having cause cpu_die to run on the killed CPU
first, then Tegra's implementation is buggy. We should simply fix that,
rather than avoiding this by forcing a different order for the calls to
cpu_kill/cpu_die.

If the semantics of cpu_kill/cpu_die are such that one /must/ cause
cpu_die to run on the killed CPU before cpu_kill can be used on it, then
there's a bug in the code that isn't doing that.

I'm CCing a few people in an attempt to find out exactly what the
expected semantics are for cpu_kill/cpu_die; is it legal to call
cpu_kill without having first caused cpu_die to execute?
Will Deacon June 7, 2013, 6:18 p.m. UTC | #2
On Fri, Jun 07, 2013 at 05:44:33PM +0100, Stephen Warren wrote:
> On 06/07/2013 03:36 AM, Joseph Lo wrote:
> > The normal CPU hotplug flow in kernel and the flow for Tegra we expected,
> > is checking the CPU ID is OK for hotplug by "tegra_cpu_disable", the CPU
> > that would be hotplugged runs into a power-gate state by "tegra_cpu_die",
> > then the other CPU waits for the CPU that was hotplugged in reset and
> > clock gate it by "tegra_cpu_kill". That means we don't support the CPU
> > being stopped or put into offline by trigger "tegra_cpu_kill" directly.
> > It may cause a busy loop for waiting CPU in reset.
> > 
> > After the commit "62e930e reboot: rigrate shutdown/reboot to boot cpu",
> > we remove "disable_nonboot_cpus" when kernel_{restart,halt,power_off}.
> > But the ARM kernel trigger "send_smp_stop" when machine_shutdown, that
> > would cause the "tegra_cpu_kill" directly without "tegra_cpu_die" first.
> > 
> > We hook "disable_nonboot_cpus" in "reboot_notifier" to avoid that happens.
> > And it can work for reboot, shutdown, halt and kexec.
> 
> I don't believe this is the correct solution.
> 
> If the semantics of cpu_kill/cpu_die are such that it's legal to call
> only cpu_kill without having cause cpu_die to run on the killed CPU
> first, then Tegra's implementation is buggy. We should simply fix that,
> rather than avoiding this by forcing a different order for the calls to
> cpu_kill/cpu_die.
> 
> If the semantics of cpu_kill/cpu_die are such that one /must/ cause
> cpu_die to run on the killed CPU before cpu_kill can be used on it, then
> there's a bug in the code that isn't doing that.
> 
> I'm CCing a few people in an attempt to find out exactly what the
> expected semantics are for cpu_kill/cpu_die; is it legal to call
> cpu_kill without having first caused cpu_die to execute?

By cpu_kill, do you mean platform_cpu_kill called from __cpu_die? If so,
__cpu_die and cpu_die are definitely supposed to be treated as a pair, since
they synchronise via the cpu_died completion.

Will
Stephen Warren June 7, 2013, 6:56 p.m. UTC | #3
On 06/07/2013 12:18 PM, Will Deacon wrote:
> On Fri, Jun 07, 2013 at 05:44:33PM +0100, Stephen Warren wrote:
>> On 06/07/2013 03:36 AM, Joseph Lo wrote:
>>> The normal CPU hotplug flow in kernel and the flow for Tegra we expected,
>>> is checking the CPU ID is OK for hotplug by "tegra_cpu_disable", the CPU
>>> that would be hotplugged runs into a power-gate state by "tegra_cpu_die",
>>> then the other CPU waits for the CPU that was hotplugged in reset and
>>> clock gate it by "tegra_cpu_kill". That means we don't support the CPU
>>> being stopped or put into offline by trigger "tegra_cpu_kill" directly.
>>> It may cause a busy loop for waiting CPU in reset.
>>>
>>> After the commit "62e930e reboot: rigrate shutdown/reboot to boot cpu",
>>> we remove "disable_nonboot_cpus" when kernel_{restart,halt,power_off}.
>>> But the ARM kernel trigger "send_smp_stop" when machine_shutdown, that
>>> would cause the "tegra_cpu_kill" directly without "tegra_cpu_die" first.
>>>
>>> We hook "disable_nonboot_cpus" in "reboot_notifier" to avoid that happens.
>>> And it can work for reboot, shutdown, halt and kexec.
>>
>> I don't believe this is the correct solution.
>>
>> If the semantics of cpu_kill/cpu_die are such that it's legal to call
>> only cpu_kill without having cause cpu_die to run on the killed CPU
>> first, then Tegra's implementation is buggy. We should simply fix that,
>> rather than avoiding this by forcing a different order for the calls to
>> cpu_kill/cpu_die.
>>
>> If the semantics of cpu_kill/cpu_die are such that one /must/ cause
>> cpu_die to run on the killed CPU before cpu_kill can be used on it, then
>> there's a bug in the code that isn't doing that.
>>
>> I'm CCing a few people in an attempt to find out exactly what the
>> expected semantics are for cpu_kill/cpu_die; is it legal to call
>> cpu_kill without having first caused cpu_die to execute?
> 
> By cpu_kill, do you mean platform_cpu_kill called from __cpu_die?

The struct smp_operations .cpu_kill/.cpu_die hooks. So, yes.

> If so,
> __cpu_die and cpu_die are definitely supposed to be treated as a pair, since
> they synchronise via the cpu_died completion.

So the analysis I did, cribbed from our internal bug report so hopefully
it makes sense without any context there, was:

==========
Before that patch (62e930e reboot: "rigrate shutdown/reboot to boot
cpu"), kernel/sys.c:kernel_restart() and kernel_power_off() used to use
CPU hotplug mechanisms to unplug every CPU other than one CPU, then do
the reboot or shutdown. The ARM implementation of machine_restart() and
machine_power_off() both call machine_shutdown() which calls
smp_send_stop(), which IPIs to every CPU to tell it to stop. However,
since all the CPUs were unplugged, this was a no-op.

With the patch, the kernel simply disables scheduling on all CPUs except
logical CPU 0 in kernel_restart() and kernel_power_off(). This
guarantees that the code is running on logical CPU 0, but leaves the
other CPUs still present. Hence, the call to smp_send_stop() from the
ARM code is no longer a no-op. This code hangs.

The implementation of smp_send_stop() raises IPI_CPU_STOP on each CPU
(other than logical CPU 0). This eventually calls down to
tegra_cpu_kill()[1], which calls tegra_wait_cpu_in_reset() which calls
tegra20_wait_cpu_in_reset(). That hangs, because nothing has ever told
the flow controller to put the CPU in reset, so logical CPU 0 waits
indefinitely for this to happen, which is the hang.
==========

[1] Perhaps the issue is why ipi_send_stop() calls down into
tegra_cpu_kill() rather than tegra_cpu_die(), since die() is what should
be run on the killed CPU, and kill() on the killing CPU?
Stephen Warren June 7, 2013, 9:28 p.m. UTC | #4
On 06/07/2013 12:56 PM, Stephen Warren wrote:
...
> [1] Perhaps the issue is why ipi_send_stop() calls down into 
> tegra_cpu_kill() rather than tegra_cpu_die(), since die() is what
> should be run on the killed CPU, and kill() on the killing CPU?

Scratch that; I don't think it's calling down to /either/; I was
confused. It seems like it /should/ call cpu_die() though, at least if
hotplug is enabled, right?
Russell King - ARM Linux June 7, 2013, 10:15 p.m. UTC | #5
On Fri, Jun 07, 2013 at 03:28:38PM -0600, Stephen Warren wrote:
> On 06/07/2013 12:56 PM, Stephen Warren wrote:
> ...
> > [1] Perhaps the issue is why ipi_send_stop() calls down into 
> > tegra_cpu_kill() rather than tegra_cpu_die(), since die() is what
> > should be run on the killed CPU, and kill() on the killing CPU?
> 
> Scratch that; I don't think it's calling down to /either/; I was
> confused. It seems like it /should/ call cpu_die() though, at least if
> hotplug is enabled, right?

The problem is really complex.

CPU hotplug is done in paths where we're relatively confident that the
system is working correctly.  So all the features such as scheduling
are available, the timer ticks work and so forth.

However, reboot is a totally different environment.  This can happen
from almost any context with the system in any state what so ever.  A
CPU could be stuck.  A CPU could have oopsed.  The CPU which is in
the reboot code could be the CPU which has oopsed.  It could be called
from within an interrupt...

What that means is the usual CPU hotplug methods can't be used in the
reboot path.  Well, they can, but it will be fragile.

For reboot, the real solution there is not to use software-based
reboot, but bring the other cores to a halt (which is what
ipi_send_stop is doing) and then issue a hardware reset to the whole
system, including the other CPUs.
Stephen Warren June 7, 2013, 10:39 p.m. UTC | #6
On 06/07/2013 04:15 PM, Russell King - ARM Linux wrote:
> On Fri, Jun 07, 2013 at 03:28:38PM -0600, Stephen Warren wrote:
>> On 06/07/2013 12:56 PM, Stephen Warren wrote:
>> ...
>>> [1] Perhaps the issue is why ipi_send_stop() calls down into 
>>> tegra_cpu_kill() rather than tegra_cpu_die(), since die() is what
>>> should be run on the killed CPU, and kill() on the killing CPU?
>>
>> Scratch that; I don't think it's calling down to /either/; I was
>> confused. It seems like it /should/ call cpu_die() though, at least if
>> hotplug is enabled, right?
> 
> The problem is really complex.
> 
> CPU hotplug is done in paths where we're relatively confident that the
> system is working correctly.  So all the features such as scheduling
> are available, the timer ticks work and so forth.
> 
> However, reboot is a totally different environment.  This can happen
> from almost any context with the system in any state what so ever.  A
> CPU could be stuck.  A CPU could have oopsed.  The CPU which is in
> the reboot code could be the CPU which has oopsed.  It could be called
> from within an interrupt...
> 
> What that means is the usual CPU hotplug methods can't be used in the
> reboot path.  Well, they can, but it will be fragile.
> 
> For reboot, the real solution there is not to use software-based
> reboot, but bring the other cores to a halt (which is what
> ipi_send_stop is doing) and then issue a hardware reset to the whole
> system, including the other CPUs.

Ignoring the issues with oops in reboot, I think there's a bug in that
when hotplug is enabled, smp_kill_cpus() calls platform_cpu_kill(), but
nothing causes the failing CPU to ever execute smp_ops.cpu_die(). Hence,
if the implementation of smp_ops.cpu_kill() relies on the target CPU
having run smp_ops.cpu_die(), then smp_ops.cpu_kill() may not operate
correctly.

Or, must smp_ops.cpu_kill() not assume that smp_ops.cpu_die() will be
called on the target CPU? What are the semantics here? Will mentioned
that __cpu_die and cpu_die are a pair, but what about is the smp_ops are
used directly; are they also supposed to be a pair?

The change below solves the pairing issue, by making ipi_cpu_stop()
perform the low-level part of hotplug that matches what smp_kill_cpus()
call to platform_cpu_kill(). This certainly fixes the
hang-in-reboot-or-shutdown problem on Tegra.

> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 550d63c..541f667 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -581,11 +581,20 @@ static void ipi_cpu_stop(unsigned int cpu)
>  
>         set_cpu_online(cpu, false);
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> +#if 0
> +       arch_cpu_idle_dead();
> +#else
> +       /* The body of arch_cpu_idle_dead() - which is better? */
> +       cpu_die();
> +#endif
> +#else
>         local_fiq_disable();
>         local_irq_disable();
>  
>         while (1)
>                 cpu_relax();
> +#endif
>  }

Some things I'm not sure of here:

* cpu_die() calls idle_task_exit(). That's probably wrong if it's
triggered from an IPI; who knows what task it's executing. That said, if
migrate_to_reboot_cpu() did set_cpus_allowed_ptr(current,
cpumask_of(cpu)), perhaps that guarantees the CPU is running the idle
task since there's nothing else that could be running?

* ipi_cpu_stop() currently calls local_fiq_disable(), but cpu_die()
doesn't. Should both functions call both local_fiq_disable() and
local_irq_disable()?

* Perhaps smp_kill_cpus() should also be changed, to call cpu_die() not
platform_cpu_kill(), to keep the pairing correct at that level too.

Plus, I ignored any issues you raised for the oops case on reboot...
Russell King - ARM Linux June 7, 2013, 10:55 p.m. UTC | #7
On Fri, Jun 07, 2013 at 04:39:32PM -0600, Stephen Warren wrote:
> On 06/07/2013 04:15 PM, Russell King - ARM Linux wrote:
> > For reboot, the real solution there is not to use software-based
> > reboot, but bring the other cores to a halt (which is what
> > ipi_send_stop is doing) and then issue a hardware reset to the whole
> > system, including the other CPUs.
> 
> Ignoring the issues with oops in reboot, I think there's a bug in that
> when hotplug is enabled, smp_kill_cpus() calls platform_cpu_kill(), but
> nothing causes the failing CPU to ever execute smp_ops.cpu_die(). Hence,
> if the implementation of smp_ops.cpu_kill() relies on the target CPU
> having run smp_ops.cpu_die(), then smp_ops.cpu_kill() may not operate
> correctly.

Well, smp_kill_cpus() was added to get around the kexec problem -
transitioning from one kernel to the next kernel without going through
a hardware reset.  Maybe if we take a step back...

1. remove smp_kill_cpus() from smp_send_stop().
2. remove machine_shutdown() from machine_halt(), machine_power_off()
   and machine_restart().
3. call smp_send_stop() only from machine_halt(), machine_power_off() and
   machine_restart()
4. require a hardware-based reboot method for all SMP implementations;
   using soft_reboot() is not an option.

This should get us into the situation where we have a reliable method of
halting and rebooting the kernel everywhere, leaving kexec as being the
remaining problem case.

Currently, for that we effectively do smp_send_stop() followed by
smp_kill_cpus().  The no-op change for kexec there is to allow
smp_kill_cpus() to be called directly from machine_shutdown() - but
I suspect there will still be stuff that's broken with that...

So the ongoing problem remains - how to deal with kexec in a SMP
environment where it's difficult to reliably take a secondary CPU
offline to a safe place and then be able to restart it into the
next kernel...
Will Deacon June 10, 2013, 2:42 p.m. UTC | #8
On Fri, Jun 07, 2013 at 11:55:12PM +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 07, 2013 at 04:39:32PM -0600, Stephen Warren wrote:
> > On 06/07/2013 04:15 PM, Russell King - ARM Linux wrote:
> > > For reboot, the real solution there is not to use software-based
> > > reboot, but bring the other cores to a halt (which is what
> > > ipi_send_stop is doing) and then issue a hardware reset to the whole
> > > system, including the other CPUs.
> > 
> > Ignoring the issues with oops in reboot, I think there's a bug in that
> > when hotplug is enabled, smp_kill_cpus() calls platform_cpu_kill(), but
> > nothing causes the failing CPU to ever execute smp_ops.cpu_die(). Hence,
> > if the implementation of smp_ops.cpu_kill() relies on the target CPU
> > having run smp_ops.cpu_die(), then smp_ops.cpu_kill() may not operate
> > correctly.
> 
> Well, smp_kill_cpus() was added to get around the kexec problem -
> transitioning from one kernel to the next kernel without going through
> a hardware reset.  Maybe if we take a step back...
> 
> 1. remove smp_kill_cpus() from smp_send_stop().
> 2. remove machine_shutdown() from machine_halt(), machine_power_off()
>    and machine_restart().
> 3. call smp_send_stop() only from machine_halt(), machine_power_off() and
>    machine_restart()
> 4. require a hardware-based reboot method for all SMP implementations;
>    using soft_reboot() is not an option.
> 
> This should get us into the situation where we have a reliable method of
> halting and rebooting the kernel everywhere, leaving kexec as being the
> remaining problem case.
> 
> Currently, for that we effectively do smp_send_stop() followed by
> smp_kill_cpus().  The no-op change for kexec there is to allow
> smp_kill_cpus() to be called directly from machine_shutdown() - but
> I suspect there will still be stuff that's broken with that...
> 
> So the ongoing problem remains - how to deal with kexec in a SMP
> environment where it's difficult to reliably take a secondary CPU
> offline to a safe place and then be able to restart it into the
> next kernel...

For kexec, I think it's perfectly reasonable to mandate hardware-based
offlining for the secondary cores (hence the half-hearted dependency on
HOTPLUG_CPU). In that case, the only guy that has to go down the soft
reboot path is the primary CPU which shouldn't be too problematic, right?

Supporting sort-reboot of secondaries is a total PITA, even if you have some
`safe place' to put them. You still have to synchronise with non-coherent
cores so that you know when it's safe to clobber the old image, which
requires complex locking algorithms and a prevailing wind.

Will
diff mbox

Patch

diff --git a/arch/arm/mach-tegra/hotplug.c b/arch/arm/mach-tegra/hotplug.c
index a52c10e..1676669 100644
--- a/arch/arm/mach-tegra/hotplug.c
+++ b/arch/arm/mach-tegra/hotplug.c
@@ -10,6 +10,8 @@ 
 #include <linux/kernel.h>
 #include <linux/smp.h>
 #include <linux/clk/tegra.h>
+#include <linux/cpu.h>
+#include <linux/reboot.h>
 
 #include <asm/smp_plat.h>
 
@@ -57,6 +59,21 @@  int tegra_cpu_disable(unsigned int cpu)
 	}
 }
 
+/*
+ * reboot notifier for avoid busy loop when trigger tegra_cpu_kill
+ * without tegra_cpu_die first
+ */
+static int tegra_cpu_kill_notify(struct notifier_block *nb,
+				 unsigned long action, void *data)
+{
+	disable_nonboot_cpus();
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tegra_cpu_kill_nb = {
+	.notifier_call = tegra_cpu_kill_notify,
+};
+
 void __init tegra_hotplug_init(void)
 {
 	if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
@@ -68,4 +85,6 @@  void __init tegra_hotplug_init(void)
 		tegra_hotplug_shutdown = tegra30_hotplug_shutdown;
 	if (IS_ENABLED(CONFIG_ARCH_TEGRA_114_SOC) && tegra_chip_id == TEGRA114)
 		tegra_hotplug_shutdown = tegra30_hotplug_shutdown;
+
+	register_reboot_notifier(&tegra_cpu_kill_nb);
 }