Linux crashes when trying to online secondary core

Message ID	alpine.DEB.2.20.1612141807410.3556@nanos (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Wed, 14 Dec 2016 18:08:14 +0100 (CET) From: Thomas Gleixner <tglx@linutronix.de> To: Mason <slash.tmp@free.fr> Subject: Re: Linux crashes when trying to online secondary core In-Reply-To: <ef972981-3fb6-74a4-cd83-a6629d2dab2a@free.fr> Message-ID: <alpine.DEB.2.20.1612141807410.3556@nanos> References: <ef972981-3fb6-74a4-cd83-a6629d2dab2a@free.fr> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Precedence: list Cc: Mark Rutland <mark.rutland@arm.com>, Richard Cochran <rcochran@linutronix.de>, Thibaud Cornic <thibaud_cornic@sigmadesigns.com>, Peter Zijlstra <peterz@infradead.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Daniel Lezcano <daniel.lezcano@linaro.org>, Sebastian Frias <sf84@laposte.net>, Anna-Maria Gleixner <anna-maria@linutronix.de>, Ingo Molnar <mingo@kernel.org>, Linux ARM <linux-arm-kernel@lists.infradead.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

alpine.DEB.2.20.1612141807410.3556@nanos (mailing list archive)

State

New, archived

Headers

Date: Wed, 14 Dec 2016 18:08:14 +0100 (CET)
From: Thomas Gleixner <tglx@linutronix.de>
To: Mason <slash.tmp@free.fr>
Subject: Re: Linux crashes when trying to online secondary core
In-Reply-To: <ef972981-3fb6-74a4-cd83-a6629d2dab2a@free.fr>
Message-ID: <alpine.DEB.2.20.1612141807410.3556@nanos>
References: <ef972981-3fb6-74a4-cd83-a6629d2dab2a@free.fr>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Precedence: list
Cc: Mark Rutland <mark.rutland@arm.com>,
	Richard Cochran <rcochran@linutronix.de>,
	Thibaud Cornic <thibaud_cornic@sigmadesigns.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Sebastian Frias <sf84@laposte.net>, 
	Anna-Maria Gleixner <anna-maria@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>, 
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

On Wed, 14 Dec 2016, Mason wrote: > Hello, > > I'm seeing Linux v4.9 crash (dereferencing NULL) when I try to online > the secondary core, after putting it offline. Does the patch below fix the issue? Thanks, tglx 8<---------------

Comments

Mason Dec. 14, 2016, 5:47 p.m. UTC | #1

On 14/12/2016 18:08, Thomas Gleixner wrote:

> On Wed, 14 Dec 2016, Mason wrote:
> 
>> I'm seeing Linux v4.9 crash (dereferencing NULL) when I try to online
>> the secondary core, after putting it offline.
> 
> Does the patch below fix the issue?
> 
> Thanks,
> 
> 	tglx
> 	
> 8<---------------
> 
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 22acee76cf4c..2594c287b078 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -101,7 +101,6 @@ enum cpuhp_state {
>  	CPUHP_AP_ARM_L2X0_STARTING,
>  	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
>  	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
> -	CPUHP_AP_DUMMY_TIMER_STARTING,
>  	CPUHP_AP_JCORE_TIMER_STARTING,
>  	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
>  	CPUHP_AP_ARM_TWD_STARTING,
> @@ -111,6 +110,7 @@ enum cpuhp_state {
>  	CPUHP_AP_MARCO_TIMER_STARTING,
>  	CPUHP_AP_MIPS_GIC_TIMER_STARTING,
>  	CPUHP_AP_ARC_TIMER_STARTING,
> +	CPUHP_AP_DUMMY_TIMER_STARTING,
>  	CPUHP_AP_KVM_STARTING,
>  	CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
>  	CPUHP_AP_KVM_ARM_VGIC_STARTING,

$ patch -p1 < tglx.patch 
patching file include/linux/cpuhotplug.h
Hunk #1 succeeded at 80 (offset -21 lines).
Hunk #2 succeeded at 89 (offset -21 lines).

It does seem to fix the problem:

# echo 0 > /sys/devices/system/cpu/cpu1/online
SMC called with a0=0x00[000001 a1=0x00000121 a2=0x00000005  a3 =0xc01189b4  0x00000121
[1][flow/suspend3.c:39] CPU 1 die: jumping6 to. post-boot WFE
402826] CPU1: shutdown
SMC called with a0=0x00000001 a1=0x00000122 a2=0x00000000 a3=0x00000000 0x00000122
[0][flow/suspend.c:82] Killing core1
armor+++ armor: core 1 booted, entering wfe...
# echo 1 > /sys/devices/system/cpu/cpu1/online
[  215.692700] tango_boot_secondary from __cpu_up
SMC called with a0=0x80101500 a1=0x00000105 a2=0x00000000 a3=0x00000000 0x00000105
[  215.704494] tango_set_aux_boot_addr=0
SMC called with a0=0x00000001 a1=0x00000104 a2=0x00000000 a3=0x00000000 0x00000104
[0][flow/smc_handler.c:127] waking up CPU1
[  215.719308] tango_start_aux_core=0


I reverted your patch, and the kernel blows up again.

So what's the problem, and how does your patch solve it?

Regards.

Mason Dec. 15, 2016, 10:35 a.m. UTC | #2

On 14/12/2016 18:47, Mason wrote:

> On 14/12/2016 18:08, Thomas Gleixner wrote:
> 
>> On Wed, 14 Dec 2016, Mason wrote:
>>
>>> I'm seeing Linux v4.9 crash (dereferencing NULL) when I try to online
>>> the secondary core, after putting it offline.
>>
>> Does the patch below fix the issue?
>>
>> Thanks,
>>
>> 	tglx
>> 	
>> 8<---------------
>>
>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>> index 22acee76cf4c..2594c287b078 100644
>> --- a/include/linux/cpuhotplug.h
>> +++ b/include/linux/cpuhotplug.h
>> @@ -101,7 +101,6 @@ enum cpuhp_state {
>>  	CPUHP_AP_ARM_L2X0_STARTING,
>>  	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
>>  	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
>> -	CPUHP_AP_DUMMY_TIMER_STARTING,
>>  	CPUHP_AP_JCORE_TIMER_STARTING,
>>  	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
>>  	CPUHP_AP_ARM_TWD_STARTING,
>> @@ -111,6 +110,7 @@ enum cpuhp_state {
>>  	CPUHP_AP_MARCO_TIMER_STARTING,
>>  	CPUHP_AP_MIPS_GIC_TIMER_STARTING,
>>  	CPUHP_AP_ARC_TIMER_STARTING,
>> +	CPUHP_AP_DUMMY_TIMER_STARTING,
>>  	CPUHP_AP_KVM_STARTING,
>>  	CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
>>  	CPUHP_AP_KVM_ARM_VGIC_STARTING,
> 
> $ patch -p1 < tglx.patch 
> patching file include/linux/cpuhotplug.h
> Hunk #1 succeeded at 80 (offset -21 lines).
> Hunk #2 succeeded at 89 (offset -21 lines).
> 
> It does seem to fix the problem:
> 
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> SMC called with a0=0x00[000001 a1=0x00000121 a2=0x00000005  a3 =0xc01189b4  0x00000121
> [1][flow/suspend3.c:39] CPU 1 die: jumping6 to. post-boot WFE
> 402826] CPU1: shutdown
> SMC called with a0=0x00000001 a1=0x00000122 a2=0x00000000 a3=0x00000000 0x00000122
> [0][flow/suspend.c:82] Killing core1
> armor+++ armor: core 1 booted, entering wfe...
> # echo 1 > /sys/devices/system/cpu/cpu1/online
> [  215.692700] tango_boot_secondary from __cpu_up
> SMC called with a0=0x80101500 a1=0x00000105 a2=0x00000000 a3=0x00000000 0x00000105
> [  215.704494] tango_set_aux_boot_addr=0
> SMC called with a0=0x00000001 a1=0x00000104 a2=0x00000000 a3=0x00000000 0x00000104
> [0][flow/smc_handler.c:127] waking up CPU1
> [  215.719308] tango_start_aux_core=0
> 
> 
> I reverted your patch, and the kernel blows up again.
> 
> So what's the problem, and how does your patch solve it?

Link to the original report:
https://marc.info/?l=linux-arm-kernel&m=148173152524746&w=2

Forgot to CC Robin Murphy, who had provided valuable input
in similar circumstances a few months back.

Also add LKML, since this doesn't appear to be ARM-specific.

Do I need to specify which device tree I was using?

Regards.

Mark Rutland Dec. 15, 2016, noon UTC | #3

On Thu, Dec 15, 2016 at 11:35:12AM +0100, Mason wrote:
> On 14/12/2016 18:47, Mason wrote:
> > On 14/12/2016 18:08, Thomas Gleixner wrote:
> >> Does the patch below fix the issue?

> >> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> >> index 22acee76cf4c..2594c287b078 100644
> >> --- a/include/linux/cpuhotplug.h
> >> +++ b/include/linux/cpuhotplug.h
> >> @@ -101,7 +101,6 @@ enum cpuhp_state {
> >>  	CPUHP_AP_ARM_L2X0_STARTING,
> >>  	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
> >>  	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
> >> -	CPUHP_AP_DUMMY_TIMER_STARTING,
> >>  	CPUHP_AP_JCORE_TIMER_STARTING,
> >>  	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
> >>  	CPUHP_AP_ARM_TWD_STARTING,
> >> @@ -111,6 +110,7 @@ enum cpuhp_state {
> >>  	CPUHP_AP_MARCO_TIMER_STARTING,
> >>  	CPUHP_AP_MIPS_GIC_TIMER_STARTING,
> >>  	CPUHP_AP_ARC_TIMER_STARTING,
> >> +	CPUHP_AP_DUMMY_TIMER_STARTING,
> >>  	CPUHP_AP_KVM_STARTING,
> >>  	CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
> >>  	CPUHP_AP_KVM_ARM_VGIC_STARTING,

> > It does seem to fix the problem:

> > I reverted your patch, and the kernel blows up again.
> > 
> > So what's the problem, and how does your patch solve it?
> 
> Link to the original report:
> https://marc.info/?l=linux-arm-kernel&m=148173152524746&w=2
> 
> Forgot to CC Robin Murphy, who had provided valuable input
> in similar circumstances a few months back.
> 
> Also add LKML, since this doesn't appear to be ARM-specific.
> 
> Do I need to specify which device tree I was using?

This is already fixed in the linux-tip tree, with commit messages
describing the fix.

It's specific to a few clocksources, due to their hotplug callbacks
occuring later than the dummy timer. That triggers the bug fixed in:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=c1a9eeb938b5433947e5ea22f89baff3182e7075

The relevant timers were fixed in:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=smp/urgent&id=9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9

Thanks,
Mark.

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 22acee76cf4c..2594c287b078 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -101,7 +101,6 @@  enum cpuhp_state {
 	CPUHP_AP_ARM_L2X0_STARTING,
 	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
 	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
-	CPUHP_AP_DUMMY_TIMER_STARTING,
 	CPUHP_AP_JCORE_TIMER_STARTING,
 	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
 	CPUHP_AP_ARM_TWD_STARTING,
@@ -111,6 +110,7 @@  enum cpuhp_state {
 	CPUHP_AP_MARCO_TIMER_STARTING,
 	CPUHP_AP_MIPS_GIC_TIMER_STARTING,
 	CPUHP_AP_ARC_TIMER_STARTING,
+	CPUHP_AP_DUMMY_TIMER_STARTING,
 	CPUHP_AP_KVM_STARTING,
 	CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
 	CPUHP_AP_KVM_ARM_VGIC_STARTING,

Linux crashes when trying to online secondary core

Commit Message

Comments

Patch