diff mbox

[v4,4/7] arm64: Handle early CPU boot failures

Message ID 1453745225-27736-5-git-send-email-suzuki.poulose@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Suzuki K Poulose Jan. 25, 2016, 6:07 p.m. UTC
From: Suzuki K. Poulose <suzuki.poulose@arm.com>

A secondary CPU could fail to come online due to insufficient
capabilities and could simply die or loop in the kernel.
e.g, a CPU with no support for the selected kernel PAGE_SIZE
loops in kernel with MMU turned off.
or a hotplugged CPU which doesn't have one of the advertised
system capability will die during the activation.

There is no way to synchronise the status of the failing CPU
back to the master. This patch solves the issue by adding a
field to the secondary_data which can be updated by the failing
CPU. If the secondary CPU fails even before turning the MMU on,
it updates the status in a special variable reserved in the head.txt
section to make sure that the update can be cache invalidated safely
without possible sharing of cache write back granule.

Here are the possible states :

 -1. CPU_MMU_OFF - Initial value set by the master CPU, this value
indicates that the CPU could not turn the MMU on, hence the status
could not be reliably updated in the secondary_data. Instead, the
CPU has updated the status in __early_cpu_boot_status (reserved in
head.txt section)

 0. CPU_BOOT_SUCCESS - CPU has booted successfully.

 1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the
master CPU to synchronise by issuing a cpu_ops->cpu_kill.

 2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is
looping in the kernel. This information could be used by say,
kexec to check if it is really safe to do a kexec reboot.

 3. CPU_PANIC_KERNEL - CPU detected some serious issues which
requires kernel to crash immediately. The secondary CPU cannot
call panic() until it has initialised the GIC. This flag can
be used to instruct the master to do so.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
---
 arch/arm64/include/asm/smp.h    |   26 ++++++++++++++++++++++
 arch/arm64/kernel/asm-offsets.c |    2 ++
 arch/arm64/kernel/head.S        |   39 +++++++++++++++++++++++++++++---
 arch/arm64/kernel/smp.c         |   47 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 111 insertions(+), 3 deletions(-)

Comments

Catalin Marinas Feb. 3, 2016, 12:57 p.m. UTC | #1
Hi Suzuki,

On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K. Poulose wrote:
> +/* Values for secondary_data.status */
> +
> +#define CPU_MMU_OFF		-1
> +#define CPU_BOOT_SUCCESS	0
> +/* The cpu invoked ops->cpu_die, synchronise it with cpu_kill */
> +#define CPU_KILL_ME		1
> +/* The cpu couldn't die gracefully and is looping in the kernel */
> +#define CPU_STUCK_IN_KERNEL	2
> +/* Fatal system error detected by secondary CPU, crash the system */
> +#define CPU_PANIC_KERNEL	3

Please add braces around these numbers, just in case (I added them
locally).

>  /*
> + * The booting CPU updates the failed status, with MMU turned off,
> + * below which lies in head.txt to make sure it doesn't share the same writeback
> + * granule. So that we can invalidate it properly.

I can't really parse this (it looks like punctuation in the wrong place;
also "share the same..." with what?).

> + *
> + * update_early_cpu_boot_status tmp, status
> + *  - Corrupts tmp, x0, x1
> + *  - Writes 'status' to __early_cpu_boot_status and makes sure
> + *    it is committed to memory.
> + */
> +
> +	.macro	update_early_cpu_boot_status tmp, status
> +	mov	\tmp, lr
> +	adrp	x0, __early_cpu_boot_status
> +	add	x0, x0, #:lo12:__early_cpu_boot_status

Nitpick: you could use the adr_l macro.

> +	mov	x1, #\status
> +	str	x1, [x0]
> +	add	x1, x0, 4
> +	bl	__inval_cache_range
> +	mov	lr, \tmp
> +	.endm

If the CPU that's currently booting has the MMU off, what's the point of
invalidating the cache here? The operation may not even be broadcast to
the other CPU. So you actually need the invalidation before reading the
status on the primary CPU.

> +
> +ENTRY(__early_cpu_boot_status)
> +	.long 	0
> +END(__early_cpu_boot_status)

I think we should just do like __boot_cpu_mode and place it in the
.data..cacheline_aligned section. You can always use the safe
clean+invalidate before reading the value so that we don't care much
about the write-back granule.

> @@ -89,12 +101,14 @@ static DECLARE_COMPLETION(cpu_running);
>  int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  {
>  	int ret;
> +	int status;
>  
>  	/*
>  	 * We need to tell the secondary core where to find its stack and the
>  	 * page tables.
>  	 */
>  	secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
> +	update_cpu_boot_status(CPU_MMU_OFF);
>  	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
>  
>  	/*
> @@ -117,7 +131,35 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  		pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
>  	}
>  
> +	/* Make sure the update to status is visible */
> +	smp_rmb();

Which status? In relation to what?

>  	secondary_data.stack = NULL;
> +	status = READ_ONCE(secondary_data.status);
> +	if (ret && status) {
> +
> +		if (status == CPU_MMU_OFF)
> +			status = READ_ONCE(__early_cpu_boot_status);

You need cache maintenance before reading this.

> +
> +		switch (status) {
> +		default:
> +			pr_err("CPU%u: failed in unknown state : 0x%x\n",
> +					cpu, status);
> +			break;
> +		case CPU_KILL_ME:
> +			if (!op_cpu_kill(cpu)) {
> +				pr_crit("CPU%u: died during early boot\n", cpu);
> +				break;
> +			}
> +			/* Fall through */
> +			pr_crit("CPU%u: may not have shut down cleanly\n", cpu);
> +		case CPU_STUCK_IN_KERNEL:
> +			pr_crit("CPU%u: is stuck in kernel\n", cpu);
> +			cpus_stuck_in_kernel++;
> +			break;
> +		case CPU_PANIC_KERNEL:
> +			panic("CPU%u detected unsupported configuration\n", cpu);
> +		}
> +	}
>  
>  	return ret;
>  }

BTW, you can send a fix-up on top of this series with corrections, I can
fold them in.
Mark Rutland Feb. 3, 2016, 4:46 p.m. UTC | #2
On Wed, Feb 03, 2016 at 12:57:38PM +0000, Catalin Marinas wrote:
> Hi Suzuki,
> 
> On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K. Poulose wrote:
> > + * update_early_cpu_boot_status tmp, status
> > + *  - Corrupts tmp, x0, x1
> > + *  - Writes 'status' to __early_cpu_boot_status and makes sure
> > + *    it is committed to memory.
> > + */
> > +
> > +	.macro	update_early_cpu_boot_status tmp, status
> > +	mov	\tmp, lr
> > +	adrp	x0, __early_cpu_boot_status
> > +	add	x0, x0, #:lo12:__early_cpu_boot_status
> 
> Nitpick: you could use the adr_l macro.
> 
> > +	mov	x1, #\status
> > +	str	x1, [x0]
> > +	add	x1, x0, 4
> > +	bl	__inval_cache_range
> > +	mov	lr, \tmp
> > +	.endm
> 
> If the CPU that's currently booting has the MMU off, what's the point of
> invalidating the cache here?

To invalidate stale lines for this address, held in any caches prior to
the PoC. I'm assuming that __early_cpu_boot_status is sufficiently
padded to the CWG.

Cache maintenance works when SCTLR_ELx.M == 0, though barriers are
required prior to cache maintenance as non-cacheable accesses do not
hazard by VA.

The MMU being off has no effect on the cache maintenance itself.

> The operation may not even be broadcast to the other CPU. So you
> actually need the invalidation before reading the status on the
> primary CPU.

We require that CPUs are coherent when they enter the kernel, so any
cache maintenance operation _must_ affect all coherent caches (i.e. it
must be broadcast and must affect all coherent caches prior to the PoC
in this case).

> > +
> > +ENTRY(__early_cpu_boot_status)
> > +	.long 	0
> > +END(__early_cpu_boot_status)
> 
> I think we should just do like __boot_cpu_mode and place it in the
> .data..cacheline_aligned section.

I think we should add a separate __writeback_aligned annotation for
stuff like this, even if it falls in .data..cacheline_aligned for now.

Otherwise, agreed.

> You can always use the safe clean+invalidate before reading the value
> so that we don't care much about the write-back granule.

To get correct data out we need to pad to the CQG regardless of whether
the reader or the writer perform the maintenance.

Mark.
Mark Rutland Feb. 3, 2016, 5:01 p.m. UTC | #3
On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K Poulose wrote:
> From: Suzuki K. Poulose <suzuki.poulose@arm.com>
> 
> A secondary CPU could fail to come online due to insufficient
> capabilities and could simply die or loop in the kernel.
> e.g, a CPU with no support for the selected kernel PAGE_SIZE
> loops in kernel with MMU turned off.
> or a hotplugged CPU which doesn't have one of the advertised
> system capability will die during the activation.
> 
> There is no way to synchronise the status of the failing CPU
> back to the master. This patch solves the issue by adding a
> field to the secondary_data which can be updated by the failing
> CPU. If the secondary CPU fails even before turning the MMU on,
> it updates the status in a special variable reserved in the head.txt
> section to make sure that the update can be cache invalidated safely
> without possible sharing of cache write back granule.
> 
> Here are the possible states :
> 
>  -1. CPU_MMU_OFF - Initial value set by the master CPU, this value
> indicates that the CPU could not turn the MMU on, hence the status
> could not be reliably updated in the secondary_data. Instead, the
> CPU has updated the status in __early_cpu_boot_status (reserved in
> head.txt section)
> 
>  0. CPU_BOOT_SUCCESS - CPU has booted successfully.
> 
>  1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the
> master CPU to synchronise by issuing a cpu_ops->cpu_kill.
> 
>  2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is
> looping in the kernel. This information could be used by say,
> kexec to check if it is really safe to do a kexec reboot.
> 
>  3. CPU_PANIC_KERNEL - CPU detected some serious issues which
> requires kernel to crash immediately. The secondary CPU cannot
> call panic() until it has initialised the GIC. This flag can
> be used to instruct the master to do so.

When would we use this last case?

Perhaps a better option is to always throw any incompatible CPU into an
(MMU-off) pen, and assume that it's stuck in the kernel, even if we
could theoretically turn it off.

A system with incompatible CPUs is a broken system to begin with...

Otherwise, comments below.

> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/smp.h    |   26 ++++++++++++++++++++++
>  arch/arm64/kernel/asm-offsets.c |    2 ++
>  arch/arm64/kernel/head.S        |   39 +++++++++++++++++++++++++++++---
>  arch/arm64/kernel/smp.c         |   47 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 111 insertions(+), 3 deletions(-)

[...]

> @@ -650,6 +679,10 @@ __enable_mmu:
>  ENDPROC(__enable_mmu)
>  
>  __no_granule_support:
> +	/* Indicate that this CPU can't boot and is stuck in the kernel */
> +	update_early_cpu_boot_status x2, CPU_STUCK_IN_KERNEL
> +1:
>  	wfe
> -	b __no_granule_support
> +	wfi
> +	b 1b

The addition of wfi seems fine, but should be mentioned in the commit
message.

>  ENDPROC(__no_granule_support)
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 59f032b..d2721ae 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -63,6 +63,8 @@
>   * where to place its SVC stack
>   */
>  struct secondary_data secondary_data;
> +/* Number of CPUs which aren't online, but looping in kernel text. */
> +u32 cpus_stuck_in_kernel;

Why u32 rather than int?

>  
>  enum ipi_msg_type {
>  	IPI_RESCHEDULE,
> @@ -72,6 +74,16 @@ enum ipi_msg_type {
>  	IPI_IRQ_WORK,
>  };
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> +static int op_cpu_kill(unsigned int cpu);
> +#else
> +static inline int op_cpu_kill(unsigned int cpu)
> +{
> +	return -ENOSYS;
> +}
> +#endif

There is no !CONFIG_HOTPLUG_CPU configuration any more.

> +
> +
>  /*
>   * Boot a secondary CPU, and assign it the specified idle task.
>   * This also gives us the initial stack to use for this CPU.
> @@ -89,12 +101,14 @@ static DECLARE_COMPLETION(cpu_running);
>  int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  {
>  	int ret;
> +	int status;
>  
>  	/*
>  	 * We need to tell the secondary core where to find its stack and the
>  	 * page tables.
>  	 */
>  	secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
> +	update_cpu_boot_status(CPU_MMU_OFF);
>  	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
>  
>  	/*

> @@ -117,7 +131,35 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  		pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
>  	}
>  
> +	/* Make sure the update to status is visible */
> +	smp_rmb();
>  	secondary_data.stack = NULL;
> +	status = READ_ONCE(secondary_data.status);

What is the rmb intended to order here?

> +	if (ret && status) {
> +
> +		if (status == CPU_MMU_OFF)
> +			status = READ_ONCE(__early_cpu_boot_status);
> +
> +		switch (status) {
> +		default:
> +			pr_err("CPU%u: failed in unknown state : 0x%x\n",
> +					cpu, status);
> +			break;
> +		case CPU_KILL_ME:
> +			if (!op_cpu_kill(cpu)) {
> +				pr_crit("CPU%u: died during early boot\n", cpu);
> +				break;
> +			}
> +			/* Fall through */
> +			pr_crit("CPU%u: may not have shut down cleanly\n", cpu);
> +		case CPU_STUCK_IN_KERNEL:
> +			pr_crit("CPU%u: is stuck in kernel\n", cpu);
> +			cpus_stuck_in_kernel++;
> +			break;
> +		case CPU_PANIC_KERNEL:
> +			panic("CPU%u detected unsupported configuration\n", cpu);
> +		}
> +	}
>  
>  	return ret;
>  }
> @@ -185,6 +227,9 @@ asmlinkage void secondary_start_kernel(void)
>  	 */
>  	pr_info("CPU%u: Booted secondary processor [%08x]\n",
>  					 cpu, read_cpuid_id());
> +	update_cpu_boot_status(CPU_BOOT_SUCCESS);
> +	/* Make sure the status update is visible before we complete */
> +	smp_wmb();

Surely complete() has appropriate barriers?

>  	set_cpu_online(cpu, true);
>  	complete(&cpu_running);
>  
> @@ -327,10 +372,12 @@ void cpu_die_early(void)
>  	set_cpu_present(cpu, 0);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> +	update_cpu_boot_status(CPU_KILL_ME);
>  	/* Check if we can park ourselves */
>  	if (cpu_ops[cpu] && cpu_ops[cpu]->cpu_die)
>  		cpu_ops[cpu]->cpu_die(cpu);

I think you need a barrier to ensure visibility of the store prior to
calling cpu_die (i.e. you want to order an access against execution). A
dsb is what you want -- smp_wmb() only expands to a dmb.

>  #endif
> +	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>  
>  	cpu_park_loop();

Likewise here.

Mark.
Catalin Marinas Feb. 3, 2016, 5:15 p.m. UTC | #4
On Wed, Feb 03, 2016 at 05:01:15PM +0000, Mark Rutland wrote:
> On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K Poulose wrote:
> > From: Suzuki K. Poulose <suzuki.poulose@arm.com>
> > 
> > A secondary CPU could fail to come online due to insufficient
> > capabilities and could simply die or loop in the kernel.
> > e.g, a CPU with no support for the selected kernel PAGE_SIZE
> > loops in kernel with MMU turned off.
> > or a hotplugged CPU which doesn't have one of the advertised
> > system capability will die during the activation.
> > 
> > There is no way to synchronise the status of the failing CPU
> > back to the master. This patch solves the issue by adding a
> > field to the secondary_data which can be updated by the failing
> > CPU. If the secondary CPU fails even before turning the MMU on,
> > it updates the status in a special variable reserved in the head.txt
> > section to make sure that the update can be cache invalidated safely
> > without possible sharing of cache write back granule.
> > 
> > Here are the possible states :
> > 
> >  -1. CPU_MMU_OFF - Initial value set by the master CPU, this value
> > indicates that the CPU could not turn the MMU on, hence the status
> > could not be reliably updated in the secondary_data. Instead, the
> > CPU has updated the status in __early_cpu_boot_status (reserved in
> > head.txt section)
> > 
> >  0. CPU_BOOT_SUCCESS - CPU has booted successfully.
> > 
> >  1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the
> > master CPU to synchronise by issuing a cpu_ops->cpu_kill.
> > 
> >  2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is
> > looping in the kernel. This information could be used by say,
> > kexec to check if it is really safe to do a kexec reboot.
> > 
> >  3. CPU_PANIC_KERNEL - CPU detected some serious issues which
> > requires kernel to crash immediately. The secondary CPU cannot
> > call panic() until it has initialised the GIC. This flag can
> > be used to instruct the master to do so.
> 
> When would we use this last case?

It's used in a subsequent series when verifying the ASID bits. I haven't
followed the previous discussions but I guess Suzuki aims to panic the
whole kernel rather than just stop the current CPU when incompatible
ASID size is found.
Suzuki K Poulose Feb. 3, 2016, 5:23 p.m. UTC | #5
Hi Catalin,

>> +/* Fatal system error detected by secondary CPU, crash the system */
>> +#define CPU_PANIC_KERNEL	3
>
> Please add braces around these numbers, just in case (I added them
> locally).

OK, I will base my changes on top of the corrections.

>
>>   /*
>> + * The booting CPU updates the failed status, with MMU turned off,
>> + * below which lies in head.txt to make sure it doesn't share the same writeback
>> + * granule. So that we can invalidate it properly.
>
> I can't really parse this (it looks like punctuation in the wrong place;
> also "share the same..." with what?).

Sorry it doesn't parse :(. It should have been something like :

"The booting CPU updates the failed status @__early_cpu_boot_status (with
the MMU turned off). It is placed in head.txt to make sure it doesn't
share the same write back granule with anything else while the CPU updates
it."


>> +	.macro	update_early_cpu_boot_status tmp, status
>> +	mov	\tmp, lr
>> +	adrp	x0, __early_cpu_boot_status
>> +	add	x0, x0, #:lo12:__early_cpu_boot_status
>
> Nitpick: you could use the adr_l macro.

Yes, that looks much better, will keep in mind.


>> @@ -117,7 +131,35 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>>   		pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
>>   	}
>>
>> +	/* Make sure the update to status is visible */
>> +	smp_rmb();
>
> Which status? In relation to what?

The secondary_data.status updated by the successful CPUs which have mmu turned on, and
updating the cpu_running. But as Mark pointed out, complete() implies a barrier, hence
won't need it.

>
>>   	secondary_data.stack = NULL;
>> +	status = READ_ONCE(secondary_data.status);
>> +	if (ret && status) {
>> +
>> +		if (status == CPU_MMU_OFF)
>> +			status = READ_ONCE(__early_cpu_boot_status);
>
> You need cache maintenance before reading this.

OK.

Thanks
Suzuki
Suzuki K Poulose Feb. 3, 2016, 5:24 p.m. UTC | #6
On 03/02/16 17:01, Mark Rutland wrote:
> On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K Poulose wrote:
>> From: Suzuki K. Poulose <suzuki.poulose@arm.com>
>>

>>   3. CPU_PANIC_KERNEL - CPU detected some serious issues which
>> requires kernel to crash immediately. The secondary CPU cannot
>> call panic() until it has initialised the GIC. This flag can
>> be used to instruct the master to do so.
>
> When would we use this last case?

As of now, it is used when we have incompatible ASID bits.

>
> Perhaps a better option is to always throw any incompatible CPU into an
> (MMU-off) pen, and assume that it's stuck in the kernel, even if we
> could theoretically turn it off.

Right, that is another option. I am fine with either.

>> -	b __no_granule_support
>> +	wfi
>> +	b 1b
>
> The addition of wfi seems fine, but should be mentioned in the commit
> message.

Sure.

>>   struct secondary_data secondary_data;
>> +/* Number of CPUs which aren't online, but looping in kernel text. */
>> +u32 cpus_stuck_in_kernel;
>
> Why u32 rather than int?

No specific reasons, since it is going to be a quantity, which cannot be < 0,
kept it unsigned. It could be unsigned int.

>> +#ifdef CONFIG_HOTPLUG_CPU
>> +static int op_cpu_kill(unsigned int cpu);
>> +#else
>> +static inline int op_cpu_kill(unsigned int cpu)
>> +{
>> +	return -ENOSYS;
>> +}
>> +#endif
>
> There is no !CONFIG_HOTPLUG_CPU configuration any more.

Thats what I thought but then there was [1]. If you disable CONFIG_PM_SLEEP, you can
still build with !CONFIG_HOTPLUG_CPU (or in other words allnoconfig)

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/384589.html

>>
>> +	/* Make sure the update to status is visible */
>> +	smp_rmb();
>>   	secondary_data.stack = NULL;
>> +	status = READ_ONCE(secondary_data.status);
>
> What is the rmb intended to order here?

It was for the complete(). But...

>> +	update_cpu_boot_status(CPU_BOOT_SUCCESS);
>> +	/* Make sure the status update is visible before we complete */
>> +	smp_wmb();
>
> Surely complete() has appropriate barriers?

Yes, it does. We can remove it.
  
>>   #ifdef CONFIG_HOTPLUG_CPU
>> +	update_cpu_boot_status(CPU_KILL_ME);
>>   	/* Check if we can park ourselves */
>>   	if (cpu_ops[cpu] && cpu_ops[cpu]->cpu_die)
>>   		cpu_ops[cpu]->cpu_die(cpu);
>
> I think you need a barrier to ensure visibility of the store prior to
> calling cpu_die (i.e. you want to order an access against execution). A
> dsb is what you want -- smp_wmb() only expands to a dmb.
>

OK.

>>   #endif
>> +	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>>
>>   	cpu_park_loop();
>
> Likewise here.

OK.

Thanks
Suzuki
Catalin Marinas Feb. 3, 2016, 5:34 p.m. UTC | #7
On Wed, Feb 03, 2016 at 04:46:32PM +0000, Mark Rutland wrote:
> On Wed, Feb 03, 2016 at 12:57:38PM +0000, Catalin Marinas wrote:
> > On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K. Poulose wrote:
> > > + * update_early_cpu_boot_status tmp, status
> > > + *  - Corrupts tmp, x0, x1
> > > + *  - Writes 'status' to __early_cpu_boot_status and makes sure
> > > + *    it is committed to memory.
> > > + */
> > > +
> > > +	.macro	update_early_cpu_boot_status tmp, status
> > > +	mov	\tmp, lr
> > > +	adrp	x0, __early_cpu_boot_status
> > > +	add	x0, x0, #:lo12:__early_cpu_boot_status
> > 
> > Nitpick: you could use the adr_l macro.
> > 
> > > +	mov	x1, #\status
> > > +	str	x1, [x0]
> > > +	add	x1, x0, 4
> > > +	bl	__inval_cache_range
> > > +	mov	lr, \tmp
> > > +	.endm
> > 
> > If the CPU that's currently booting has the MMU off, what's the point of
> > invalidating the cache here?
> 
> To invalidate stale lines for this address, held in any caches prior to
> the PoC. I'm assuming that __early_cpu_boot_status is sufficiently
> padded to the CWG.

I would have rather invalidated it before writing the [x0], if that's
what it's aimed at.

> Cache maintenance works when SCTLR_ELx.M == 0, though barriers are
> required prior to cache maintenance as non-cacheable accesses do not
> hazard by VA.
> 
> The MMU being off has no effect on the cache maintenance itself.

I know, but whether it has an effect on other CPUs is a different
question (it probably has). Anyway, I would rather do the invalidation
on the CPU that actually reads this status.

> > The operation may not even be broadcast to the other CPU. So you
> > actually need the invalidation before reading the status on the
> > primary CPU.
> 
> We require that CPUs are coherent when they enter the kernel, so any
> cache maintenance operation _must_ affect all coherent caches (i.e. it
> must be broadcast and must affect all coherent caches prior to the PoC
> in this case).

In general, if you perform cache maintenance on a non-shareable mapping,
I don't think it would be broadcast. But in this case, the MMU is off,
data accesses default to Device_nGnRnE and considered outer shareable,
so it may actually work. Is this stated anywhere in the ARM ARM?

But see my point above about invalidating on the secondary CPU before
writing the location and invalidating again on the primary CPU before
reading it.
Mark Rutland Feb. 3, 2016, 5:35 p.m. UTC | #8
On Wed, Feb 03, 2016 at 05:24:07PM +0000, Suzuki K. Poulose wrote:
> On 03/02/16 17:01, Mark Rutland wrote:
> >On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K Poulose wrote:
> >>From: Suzuki K. Poulose <suzuki.poulose@arm.com>

[...]

> >>  struct secondary_data secondary_data;
> >>+/* Number of CPUs which aren't online, but looping in kernel text. */
> >>+u32 cpus_stuck_in_kernel;
> >
> >Why u32 rather than int?
> 
> No specific reasons, since it is going to be a quantity, which cannot be < 0,
> kept it unsigned. It could be unsigned int.

Elsewhere, int or unsigned int is used to contain a cpu number. I think
either would be preferable to u32, to at least limit the inconsistency.

> >>+#ifdef CONFIG_HOTPLUG_CPU
> >>+static int op_cpu_kill(unsigned int cpu);
> >>+#else
> >>+static inline int op_cpu_kill(unsigned int cpu)
> >>+{
> >>+	return -ENOSYS;
> >>+}
> >>+#endif
> >
> >There is no !CONFIG_HOTPLUG_CPU configuration any more.
> 
> Thats what I thought but then there was [1]. If you disable CONFIG_PM_SLEEP, you can
> still build with !CONFIG_HOTPLUG_CPU (or in other words allnoconfig)
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/384589.html

Aargh, indeed. I had confused CONFIG_HOTPLUG_CPU and CONFIG_SMP. Sorry!

Mark.
Mark Rutland Feb. 3, 2016, 5:53 p.m. UTC | #9
On Wed, Feb 03, 2016 at 05:34:49PM +0000, Catalin Marinas wrote:
> On Wed, Feb 03, 2016 at 04:46:32PM +0000, Mark Rutland wrote:
> > On Wed, Feb 03, 2016 at 12:57:38PM +0000, Catalin Marinas wrote:
> > > On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K. Poulose wrote:
> > > > + * update_early_cpu_boot_status tmp, status
> > > > + *  - Corrupts tmp, x0, x1
> > > > + *  - Writes 'status' to __early_cpu_boot_status and makes sure
> > > > + *    it is committed to memory.
> > > > + */
> > > > +
> > > > +	.macro	update_early_cpu_boot_status tmp, status
> > > > +	mov	\tmp, lr
> > > > +	adrp	x0, __early_cpu_boot_status
> > > > +	add	x0, x0, #:lo12:__early_cpu_boot_status
> > > 
> > > Nitpick: you could use the adr_l macro.
> > > 
> > > > +	mov	x1, #\status
> > > > +	str	x1, [x0]
> > > > +	add	x1, x0, 4
> > > > +	bl	__inval_cache_range
> > > > +	mov	lr, \tmp
> > > > +	.endm
> > > 
> > > If the CPU that's currently booting has the MMU off, what's the point of
> > > invalidating the cache here?
> > 
> > To invalidate stale lines for this address, held in any caches prior to
> > the PoC. I'm assuming that __early_cpu_boot_status is sufficiently
> > padded to the CWG.
> 
> I would have rather invalidated it before writing the [x0], if that's
> what it's aimed at.

That alone wouldn't not be sufficient, due to speculative fetches
allocating new (clean) lines prior to the write completing.

I was expecting the CWG-aligned region to only be written to with the
MMU off, i.e. we'd only have clean stale lines and no dirty lines.

> > Cache maintenance works when SCTLR_ELx.M == 0, though barriers are
> > required prior to cache maintenance as non-cacheable accesses do not
> > hazard by VA.
> > 
> > The MMU being off has no effect on the cache maintenance itself.
> 
> I know, but whether it has an effect on other CPUs is a different
> question (it probably has). Anyway, I would rather do the invalidation
> on the CPU that actually reads this status.

My only worry would be how this gets ordered against the (non-cacheable)
store. I guess we'd complete that with a DSB SY regardless.

Given that, I have no problem doing the invalidate on the read side.
Assuming we only write from the side with the MMU off, we don't need
maintenance on that side.

> > > The operation may not even be broadcast to the other CPU. So you
> > > actually need the invalidation before reading the status on the
> > > primary CPU.
> > 
> > We require that CPUs are coherent when they enter the kernel, so any
> > cache maintenance operation _must_ affect all coherent caches (i.e. it
> > must be broadcast and must affect all coherent caches prior to the PoC
> > in this case).
> 
> In general, if you perform cache maintenance on a non-shareable mapping,
> I don't think it would be broadcast. But in this case, the MMU is off,
> data accesses default to Device_nGnRnE and considered outer shareable,
> so it may actually work. Is this stated anywhere in the ARM ARM?

In ARM DDI 0487A.h, D4.2.8 "The effects of disabling a stage of address
translation" we state:

	Cache maintenance instructions act on the target cache
	regardless of whether any stages of address translation are
	disabled, and regardless of the values of the memory attributes.
	However, if a stage of address translation is disabled, they use
	the flat address mapping for that translation stage.

Thanks,
Mark.
Catalin Marinas Feb. 3, 2016, 6:12 p.m. UTC | #10
On Wed, Feb 03, 2016 at 05:53:51PM +0000, Mark Rutland wrote:
> On Wed, Feb 03, 2016 at 05:34:49PM +0000, Catalin Marinas wrote:
> > On Wed, Feb 03, 2016 at 04:46:32PM +0000, Mark Rutland wrote:
> > > On Wed, Feb 03, 2016 at 12:57:38PM +0000, Catalin Marinas wrote:
> > > > On Mon, Jan 25, 2016 at 06:07:02PM +0000, Suzuki K. Poulose wrote:
> > > > > + * update_early_cpu_boot_status tmp, status
> > > > > + *  - Corrupts tmp, x0, x1
> > > > > + *  - Writes 'status' to __early_cpu_boot_status and makes sure
> > > > > + *    it is committed to memory.
> > > > > + */
> > > > > +
> > > > > +	.macro	update_early_cpu_boot_status tmp, status
> > > > > +	mov	\tmp, lr
> > > > > +	adrp	x0, __early_cpu_boot_status
> > > > > +	add	x0, x0, #:lo12:__early_cpu_boot_status
> > > > 
> > > > Nitpick: you could use the adr_l macro.
> > > > 
> > > > > +	mov	x1, #\status
> > > > > +	str	x1, [x0]
> > > > > +	add	x1, x0, 4
> > > > > +	bl	__inval_cache_range
> > > > > +	mov	lr, \tmp
> > > > > +	.endm
> > > > 
> > > > If the CPU that's currently booting has the MMU off, what's the point of
> > > > invalidating the cache here?
> > > 
> > > To invalidate stale lines for this address, held in any caches prior to
> > > the PoC. I'm assuming that __early_cpu_boot_status is sufficiently
> > > padded to the CWG.
> > 
> > I would have rather invalidated it before writing the [x0], if that's
> > what it's aimed at.
> 
> That alone wouldn't not be sufficient, due to speculative fetches
> allocating new (clean) lines prior to the write completing.

Indeed.

> I was expecting the CWG-aligned region to only be written to with the
> MMU off, i.e. we'd only have clean stale lines and no dirty lines.

I assume that's true even in a guest.

> > > Cache maintenance works when SCTLR_ELx.M == 0, though barriers are
> > > required prior to cache maintenance as non-cacheable accesses do not
> > > hazard by VA.
> > > 
> > > The MMU being off has no effect on the cache maintenance itself.
> > 
> > I know, but whether it has an effect on other CPUs is a different
> > question (it probably has). Anyway, I would rather do the invalidation
> > on the CPU that actually reads this status.
> 
> My only worry would be how this gets ordered against the (non-cacheable)
> store. I guess we'd complete that with a DSB SY regardless.

The synchronisation on the primary CPU is a bit flaky anyway and just
relies on long timeouts. It waits for a while to complete (1 sec) and
than it reads the status. So it assumes that the secondary CPU finished
its writing but neither DSB nor cache maintenance guarantee this.

> Given that, I have no problem doing the invalidate on the read side.
> Assuming we only write from the side with the MMU off, we don't need
> maintenance on that side.

I agree.

> > > > The operation may not even be broadcast to the other CPU. So you
> > > > actually need the invalidation before reading the status on the
> > > > primary CPU.
> > > 
> > > We require that CPUs are coherent when they enter the kernel, so any
> > > cache maintenance operation _must_ affect all coherent caches (i.e. it
> > > must be broadcast and must affect all coherent caches prior to the PoC
> > > in this case).
> > 
> > In general, if you perform cache maintenance on a non-shareable mapping,
> > I don't think it would be broadcast. But in this case, the MMU is off,
> > data accesses default to Device_nGnRnE and considered outer shareable,
> > so it may actually work. Is this stated anywhere in the ARM ARM?
> 
> In ARM DDI 0487A.h, D4.2.8 "The effects of disabling a stage of address
> translation" we state:
> 
> 	Cache maintenance instructions act on the target cache
> 	regardless of whether any stages of address translation are
> 	disabled, and regardless of the values of the memory attributes.
> 	However, if a stage of address translation is disabled, they use
> 	the flat address mapping for that translation stage.

But does "target cache" include other CPUs in the system?
Mark Rutland Feb. 3, 2016, 7:31 p.m. UTC | #11
On Wed, Feb 03, 2016 at 06:12:52PM +0000, Catalin Marinas wrote:
> On Wed, Feb 03, 2016 at 05:53:51PM +0000, Mark Rutland wrote:
> > On Wed, Feb 03, 2016 at 05:34:49PM +0000, Catalin Marinas wrote:
> > > In general, if you perform cache maintenance on a non-shareable mapping,
> > > I don't think it would be broadcast. But in this case, the MMU is off,
> > > data accesses default to Device_nGnRnE and considered outer shareable,
> > > so it may actually work. Is this stated anywhere in the ARM ARM?
> > 
> > In ARM DDI 0487A.h, D4.2.8 "The effects of disabling a stage of address
> > translation" we state:
> > 
> > 	Cache maintenance instructions act on the target cache
> > 	regardless of whether any stages of address translation are
> > 	disabled, and regardless of the values of the memory attributes.
> > 	However, if a stage of address translation is disabled, they use
> > 	the flat address mapping for that translation stage.
> 
> But does "target cache" include other CPUs in the system?

It appears so. There's a more complete description (with a table) on
page D3-1718, "Effects of instructions that operate by VA to the Point
of Coherency", where it's explicitly stated:

	For Device memory and Normal memory that is Inner Non-cacheable,
	Outer Non-cacheable, these instructions must affect the caches
	of all PEs in the Outer Shareable shareability domain of the PE
	on which the instruction is operating.

In other cases, per the table, the maintenance may be limited to a
smaller set of caches.

Thanks,
Mark.
diff mbox

Patch

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 32e75ee..6548769 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -16,6 +16,19 @@ 
 #ifndef __ASM_SMP_H
 #define __ASM_SMP_H
 
+/* Values for secondary_data.status */
+
+#define CPU_MMU_OFF		-1
+#define CPU_BOOT_SUCCESS	0
+/* The cpu invoked ops->cpu_die, synchronise it with cpu_kill */
+#define CPU_KILL_ME		1
+/* The cpu couldn't die gracefully and is looping in the kernel */
+#define CPU_STUCK_IN_KERNEL	2
+/* Fatal system error detected by secondary CPU, crash the system */
+#define CPU_PANIC_KERNEL	3
+
+#ifndef __ASSEMBLY__
+
 #include <linux/threads.h>
 #include <linux/cpumask.h>
 #include <linux/thread_info.h>
@@ -54,11 +67,17 @@  asmlinkage void secondary_start_kernel(void);
 
 /*
  * Initial data for bringing up a secondary CPU.
+ * @stack  - sp for the secondary CPU
+ * @status - Result passed back from the secondary CPU to
+ *           indicate failure.
  */
 struct secondary_data {
 	void *stack;
+	int status;
 };
+
 extern struct secondary_data secondary_data;
+extern int __early_cpu_boot_status;
 extern void secondary_entry(void);
 
 extern void arch_send_call_function_single_ipi(int cpu);
@@ -78,4 +97,11 @@  static inline void cpu_park_loop(void)
 	}
 }
 
+static inline void update_cpu_boot_status(int val)
+{
+	WRITE_ONCE(secondary_data.status, val);
+}
+
+#endif /* ifndef __ASSEMBLY__ */
+
 #endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index fffa4ac6..463fcf1 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -104,6 +104,8 @@  int main(void)
   DEFINE(TZ_MINWEST,		offsetof(struct timezone, tz_minuteswest));
   DEFINE(TZ_DSTTIME,		offsetof(struct timezone, tz_dsttime));
   BLANK();
+  DEFINE(CPU_BOOT_STACK,	offsetof(struct secondary_data, stack));
+  BLANK();
 #ifdef CONFIG_KVM_ARM_HOST
   DEFINE(VCPU_CONTEXT,		offsetof(struct kvm_vcpu, arch.ctxt));
   DEFINE(CPU_GP_REGS,		offsetof(struct kvm_cpu_context, gp_regs));
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index ffe9c2b..16bfa34 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -34,6 +34,7 @@ 
 #include <asm/pgtable-hwdef.h>
 #include <asm/pgtable.h>
 #include <asm/page.h>
+#include <asm/smp.h>
 #include <asm/sysreg.h>
 #include <asm/thread_info.h>
 #include <asm/virt.h>
@@ -606,7 +607,7 @@  ENTRY(secondary_startup)
 ENDPROC(secondary_startup)
 
 ENTRY(__secondary_switched)
-	ldr	x0, [x21]			// get secondary_data.stack
+	ldr	x0, [x21, #CPU_BOOT_STACK]	// get secondary_data.stack
 	mov	sp, x0
 	and	x0, x0, #~(THREAD_SIZE - 1)
 	msr	sp_el0, x0			// save thread_info
@@ -615,6 +616,32 @@  ENTRY(__secondary_switched)
 ENDPROC(__secondary_switched)
 
 /*
+ * The booting CPU updates the failed status, with MMU turned off,
+ * below which lies in head.txt to make sure it doesn't share the same writeback
+ * granule. So that we can invalidate it properly.
+ *
+ * update_early_cpu_boot_status tmp, status
+ *  - Corrupts tmp, x0, x1
+ *  - Writes 'status' to __early_cpu_boot_status and makes sure
+ *    it is committed to memory.
+ */
+
+	.macro	update_early_cpu_boot_status tmp, status
+	mov	\tmp, lr
+	adrp	x0, __early_cpu_boot_status
+	add	x0, x0, #:lo12:__early_cpu_boot_status
+	mov	x1, #\status
+	str	x1, [x0]
+	add	x1, x0, 4
+	bl	__inval_cache_range
+	mov	lr, \tmp
+	.endm
+
+ENTRY(__early_cpu_boot_status)
+	.long 	0
+END(__early_cpu_boot_status)
+
+/*
  * Enable the MMU.
  *
  *  x0  = SCTLR_EL1 value for turning on the MMU.
@@ -631,12 +658,14 @@  __enable_mmu:
 	ubfx	x2, x1, #ID_AA64MMFR0_TGRAN_SHIFT, 4
 	cmp	x2, #ID_AA64MMFR0_TGRAN_SUPPORTED
 	b.ne	__no_granule_support
+	mov	x4, x0
+	update_early_cpu_boot_status x2, 0
 	ldr	x5, =vectors
 	msr	vbar_el1, x5
 	msr	ttbr0_el1, x25			// load TTBR0
 	msr	ttbr1_el1, x26			// load TTBR1
 	isb
-	msr	sctlr_el1, x0
+	msr	sctlr_el1, x4
 	isb
 	/*
 	 * Invalidate the local I-cache so that any instructions fetched
@@ -650,6 +679,10 @@  __enable_mmu:
 ENDPROC(__enable_mmu)
 
 __no_granule_support:
+	/* Indicate that this CPU can't boot and is stuck in the kernel */
+	update_early_cpu_boot_status x2, CPU_STUCK_IN_KERNEL
+1:
 	wfe
-	b __no_granule_support
+	wfi
+	b 1b
 ENDPROC(__no_granule_support)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 59f032b..d2721ae 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -63,6 +63,8 @@ 
  * where to place its SVC stack
  */
 struct secondary_data secondary_data;
+/* Number of CPUs which aren't online, but looping in kernel text. */
+u32 cpus_stuck_in_kernel;
 
 enum ipi_msg_type {
 	IPI_RESCHEDULE,
@@ -72,6 +74,16 @@  enum ipi_msg_type {
 	IPI_IRQ_WORK,
 };
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int op_cpu_kill(unsigned int cpu);
+#else
+static inline int op_cpu_kill(unsigned int cpu)
+{
+	return -ENOSYS;
+}
+#endif
+
+
 /*
  * Boot a secondary CPU, and assign it the specified idle task.
  * This also gives us the initial stack to use for this CPU.
@@ -89,12 +101,14 @@  static DECLARE_COMPLETION(cpu_running);
 int __cpu_up(unsigned int cpu, struct task_struct *idle)
 {
 	int ret;
+	int status;
 
 	/*
 	 * We need to tell the secondary core where to find its stack and the
 	 * page tables.
 	 */
 	secondary_data.stack = task_stack_page(idle) + THREAD_START_SP;
+	update_cpu_boot_status(CPU_MMU_OFF);
 	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
 
 	/*
@@ -117,7 +131,35 @@  int __cpu_up(unsigned int cpu, struct task_struct *idle)
 		pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
 	}
 
+	/* Make sure the update to status is visible */
+	smp_rmb();
 	secondary_data.stack = NULL;
+	status = READ_ONCE(secondary_data.status);
+	if (ret && status) {
+
+		if (status == CPU_MMU_OFF)
+			status = READ_ONCE(__early_cpu_boot_status);
+
+		switch (status) {
+		default:
+			pr_err("CPU%u: failed in unknown state : 0x%x\n",
+					cpu, status);
+			break;
+		case CPU_KILL_ME:
+			if (!op_cpu_kill(cpu)) {
+				pr_crit("CPU%u: died during early boot\n", cpu);
+				break;
+			}
+			/* Fall through */
+			pr_crit("CPU%u: may not have shut down cleanly\n", cpu);
+		case CPU_STUCK_IN_KERNEL:
+			pr_crit("CPU%u: is stuck in kernel\n", cpu);
+			cpus_stuck_in_kernel++;
+			break;
+		case CPU_PANIC_KERNEL:
+			panic("CPU%u detected unsupported configuration\n", cpu);
+		}
+	}
 
 	return ret;
 }
@@ -185,6 +227,9 @@  asmlinkage void secondary_start_kernel(void)
 	 */
 	pr_info("CPU%u: Booted secondary processor [%08x]\n",
 					 cpu, read_cpuid_id());
+	update_cpu_boot_status(CPU_BOOT_SUCCESS);
+	/* Make sure the status update is visible before we complete */
+	smp_wmb();
 	set_cpu_online(cpu, true);
 	complete(&cpu_running);
 
@@ -327,10 +372,12 @@  void cpu_die_early(void)
 	set_cpu_present(cpu, 0);
 
 #ifdef CONFIG_HOTPLUG_CPU
+	update_cpu_boot_status(CPU_KILL_ME);
 	/* Check if we can park ourselves */
 	if (cpu_ops[cpu] && cpu_ops[cpu]->cpu_die)
 		cpu_ops[cpu]->cpu_die(cpu);
 #endif
+	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
 
 	cpu_park_loop();
 }