diff mbox

[1/2] arm64: smp: Add function to determine if cpus are stuck in the kernel

Message ID 1466156097-20028-2-git-send-email-james.morse@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

James Morse June 17, 2016, 9:34 a.m. UTC
kernel/smp.c has a fancy counter that keeps track of the number of CPUs
it marked as not-present and left in cpu_park_loop(). If there are any
CPUs spinning in here, features like kexec or hibernate may release them
by overwriting this memory.

This problem also occurs on machines using spin-tables to release
secondary cores.
After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N")
we bring all known cpus into the secondary holding pen, but may not bring
them up depending on 'maxcpus'. This memory can't be re-used by kexec
or hibernate.

Add a function cpus_are_stuck_in_kernel() to determine if either of these
cases have occurred.

Signed-off-by: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++
 arch/arm64/kernel/smp.c      | 13 +++++++++++++
 2 files changed, 33 insertions(+)

Comments

Mark Rutland June 17, 2016, 10:27 a.m. UTC | #1
On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote:
> kernel/smp.c has a fancy counter that keeps track of the number of CPUs
> it marked as not-present and left in cpu_park_loop(). If there are any
> CPUs spinning in here, features like kexec or hibernate may release them
> by overwriting this memory.
> 
> This problem also occurs on machines using spin-tables to release
> secondary cores.
> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N")
> we bring all known cpus into the secondary holding pen, but may not bring
> them up depending on 'maxcpus'. This memory can't be re-used by kexec
> or hibernate.
> 
> Add a function cpus_are_stuck_in_kernel() to determine if either of these
> cases have occurred.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++
>  arch/arm64/kernel/smp.c      | 13 +++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> index 433e50405274..4be755bcc07a 100644
> --- a/arch/arm64/include/asm/smp.h
> +++ b/arch/arm64/include/asm/smp.h
> @@ -124,6 +124,26 @@ static inline void cpu_panic_kernel(void)
>  	cpu_park_loop();
>  }
>  
> +/*
> + * Kernel features such as hibernate and kexec depend on cpu hotplug to know
> + * they can replace any kernel memory they are not using themselves.
> + *
> + * There are two corner cases:
> + * If a secondary CPU fails to come online, (e.g. due to mismatched features),
> + * it will try to call cpu_die(). If this fails, it increases the counter
> + * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing
> + * this function must not be re-used for anything else as the 'stuck' core
> + * is executing it.

It might also be stuck in __no_granule_support, if it never made it to C
code. In that case, the CPU in charge of bringing up that new CPU will
increment the counter in __cpu_up.

There might be other reasons we do something like that in future, so it
might be better to be a little less specific and say something like:

	If a secondary CPU enters the kernel but fails to come online,
	(e.g. due to mismatched features), and cannot exit the kernel,
	we increment cpus_stuck_in_kernel and leave the CPU in a
	quiesecent loop within the kernel text. The memory containing
	this loop must not be re-used for anything else as the 'stuck'
	core is executing it.

Otherwise, this looks good. FWIW, either way:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> + *
> + * CPUs are also considered stuck in the kernel if we have multiple CPUs
> + * and no way to offline secondary CPUs. This happens when secondaries
> + * are released via spin-table, these CPUs are moved into the kernel's
> + * secondary_holding_pen, which must not be overwritten.
> + *
> + * This function is used to inhibit features like kexec and hibernate.
> + */
> +bool cpus_are_stuck_in_kernel(void);
> +
>  #endif /* ifndef __ASSEMBLY__ */
>  
>  #endif /* ifndef __ASM_SMP_H */
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 678e0842cb3b..e197502f94fd 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier)
>  {
>  	return -EINVAL;
>  }
> +
> +bool cpus_are_stuck_in_kernel(void)
> +{
> +	bool ret = !!cpus_stuck_in_kernel;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int any_cpu = raw_smp_processor_id();
> +
> +	if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
> +		ret = true;
> +#endif
> +
> +	return ret;
> +}
> -- 
> 2.8.0.rc3
>
Suzuki K Poulose June 17, 2016, 11:16 a.m. UTC | #2
On 17/06/16 11:27, Mark Rutland wrote:
> On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote:
>> kernel/smp.c has a fancy counter that keeps track of the number of CPUs
>> it marked as not-present and left in cpu_park_loop(). If there are any
>> CPUs spinning in here, features like kexec or hibernate may release them
>> by overwriting this memory.
>>
>> This problem also occurs on machines using spin-tables to release
>> secondary cores.
>> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N")
>> we bring all known cpus into the secondary holding pen, but may not bring
>> them up depending on 'maxcpus'. This memory can't be re-used by kexec
>> or hibernate.
>>
>> Add a function cpus_are_stuck_in_kernel() to determine if either of these
>> cases have occurred.
>>
>> Signed-off-by: James Morse <james.morse@arm.com>
>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++
>>   arch/arm64/kernel/smp.c      | 13 +++++++++++++
>>   2 files changed, 33 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
>> index 433e50405274..4be755bcc07a 100644
>> --- a/arch/arm64/include/asm/smp.h
>> +++ b/arch/arm64/include/asm/smp.h
>> @@ -124,6 +124,26 @@ static inline void cpu_panic_kernel(void)
>>   	cpu_park_loop();
>>   }
>>
>> +/*
>> + * Kernel features such as hibernate and kexec depend on cpu hotplug to know
>> + * they can replace any kernel memory they are not using themselves.
>> + *
>> + * There are two corner cases:
>> + * If a secondary CPU fails to come online, (e.g. due to mismatched features),
>> + * it will try to call cpu_die(). If this fails, it increases the counter
>> + * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing
>> + * this function must not be re-used for anything else as the 'stuck' core
>> + * is executing it.
>
> It might also be stuck in __no_granule_support, if it never made it to C
> code. In that case, the CPU in charge of bringing up that new CPU will
> increment the counter in __cpu_up.

Just to clarify, *in all the cases*, the CPU in charge of bringing up updates
the cpus_stuck_in_kernel.

>
> There might be other reasons we do something like that in future, so it
> might be better to be a little less specific and say something like:
>
> 	If a secondary CPU enters the kernel but fails to come online,
> 	(e.g. due to mismatched features), and cannot exit the kernel,
> 	we increment cpus_stuck_in_kernel and leave the CPU in a
> 	quiesecent loop within the kernel text. The memory containing
> 	this loop must not be re-used for anything else as the 'stuck'
> 	core is executing it.

Agree.

>> +bool cpus_are_stuck_in_kernel(void);
>> +
>>   #endif /* ifndef __ASSEMBLY__ */
>>
>>   #endif /* ifndef __ASM_SMP_H */
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index 678e0842cb3b..e197502f94fd 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier)
>>   {
>>   	return -EINVAL;
>>   }
>> +
>> +bool cpus_are_stuck_in_kernel(void)
>> +{
>> +	bool ret = !!cpus_stuck_in_kernel;
>> +#ifdef CONFIG_HOTPLUG_CPU
>> +	int any_cpu = raw_smp_processor_id();
>> +
>> +	if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
>> +		ret = true;
>> +#endif

Minor nit: Moving the cpu_die check to a static inline function with
an obvious name might make the code look better.

	return !!cpus_stuck_in_kernel || !have_cpu_die() ?

Eitherway,

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Cheers
Suzuki

>> +
>> +	return ret;
>> +}
>> --
>> 2.8.0.rc3
>>
>
Geoff Levand June 17, 2016, 4:37 p.m. UTC | #3
On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote:
> +bool cpus_are_stuck_in_kernel(void)
> +{
> +> 	> bool ret = !!cpus_stuck_in_kernel;
> +#ifdef CONFIG_HOTPLUG_CPU

How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here?

> +	int any_cpu = raw_smp_processor_id();
> +
> +> 	> if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
> +> 	> 	> ret = true;
> +#endif

-Geoff
Suzuki K Poulose June 17, 2016, 4:39 p.m. UTC | #4
On 17/06/16 17:37, Geoff Levand wrote:
> On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote:
>> +bool cpus_are_stuck_in_kernel(void)
>> +{
>> +> 	> bool ret = !!cpus_stuck_in_kernel;
>> +#ifdef CONFIG_HOTPLUG_CPU
>
> How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here?
>
>> +	int any_cpu = raw_smp_processor_id();
>> +
>> +> 	> if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
>> +> 	> 	> ret = true;
>> +#endif
>

That won't work, as the cpu_ops->cpu_die is defined only if CONFIG_HOTPLUG_CPU.

May be we should fix that.

Suzuki
Mark Rutland June 17, 2016, 4:42 p.m. UTC | #5
On Fri, Jun 17, 2016 at 09:37:12AM -0700, Geoff Levand wrote:
> On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote:
> > +bool cpus_are_stuck_in_kernel(void)
> > +{
> > +> 	> bool ret = !!cpus_stuck_in_kernel;
> > +#ifdef CONFIG_HOTPLUG_CPU
> 
> How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here?

I wanted to suggest that too, but ...

> > +	int any_cpu = raw_smp_processor_id();
> > +
> > +> 	> if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
> > +> 	> 	> ret = true;

... here we'd blow up as cpu_ops::cpu_die is only defined when
CONFIG_HOTPLUG_CPU is selected.

So we'll have to use an ifdef, but we could instead ifdef a whole stub
function (e.g. have_cpu_die()), as Suzuki suggested.

Thanks,
Mark.
James Morse June 17, 2016, 4:48 p.m. UTC | #6
Hi Suzuki,

On 17/06/16 12:16, Suzuki K Poulose wrote:
> On 17/06/16 11:27, Mark Rutland wrote:
>> On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote:
>>> kernel/smp.c has a fancy counter that keeps track of the number of CPUs
>>> it marked as not-present and left in cpu_park_loop(). If there are any
>>> CPUs spinning in here, features like kexec or hibernate may release them
>>> by overwriting this memory.
>>>
>>> This problem also occurs on machines using spin-tables to release
>>> secondary cores.
>>> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N")
>>> we bring all known cpus into the secondary holding pen, but may not bring
>>> them up depending on 'maxcpus'. This memory can't be re-used by kexec
>>> or hibernate.
>>>
>>> Add a function cpus_are_stuck_in_kernel() to determine if either of these
>>> cases have occurred.

>> It might also be stuck in __no_granule_support, if it never made it to C
>> code. In that case, the CPU in charge of bringing up that new CPU will
>> increment the counter in __cpu_up.
> 
> Just to clarify, *in all the cases*, the CPU in charge of bringing up updates
> the cpus_stuck_in_kernel.

Ah, my mistake. I will switch it for Mark's suggestion.


>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index 678e0842cb3b..e197502f94fd 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier)
>>>   {
>>>       return -EINVAL;
>>>   }
>>> +
>>> +bool cpus_are_stuck_in_kernel(void)
>>> +{
>>> +    bool ret = !!cpus_stuck_in_kernel;
>>> +#ifdef CONFIG_HOTPLUG_CPU
>>> +    int any_cpu = raw_smp_processor_id();
>>> +
>>> +    if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
>>> +        ret = true;
>>> +#endif
> 
> Minor nit: Moving the cpu_die check to a static inline function with
> an obvious name might make the code look better.
> 
>     return !!cpus_stuck_in_kernel || !have_cpu_die() ?
> 

That would be better!


> Eitherway,
> 
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Thanks,

James
Will Deacon June 21, 2016, 6:07 p.m. UTC | #7
HI James,

On Fri, Jun 17, 2016 at 05:48:35PM +0100, James Morse wrote:
> On 17/06/16 12:16, Suzuki K Poulose wrote:
> > On 17/06/16 11:27, Mark Rutland wrote:
> >> On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote:
> >>> kernel/smp.c has a fancy counter that keeps track of the number of CPUs
> >>> it marked as not-present and left in cpu_park_loop(). If there are any
> >>> CPUs spinning in here, features like kexec or hibernate may release them
> >>> by overwriting this memory.
> >>>
> >>> This problem also occurs on machines using spin-tables to release
> >>> secondary cores.
> >>> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N")
> >>> we bring all known cpus into the secondary holding pen, but may not bring
> >>> them up depending on 'maxcpus'. This memory can't be re-used by kexec
> >>> or hibernate.
> >>>
> >>> Add a function cpus_are_stuck_in_kernel() to determine if either of these
> >>> cases have occurred.
> 
> >> It might also be stuck in __no_granule_support, if it never made it to C
> >> code. In that case, the CPU in charge of bringing up that new CPU will
> >> increment the counter in __cpu_up.
> > 
> > Just to clarify, *in all the cases*, the CPU in charge of bringing up updates
> > the cpus_stuck_in_kernel.
> 
> Ah, my mistake. I will switch it for Mark's suggestion.
> 
> >>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >>> index 678e0842cb3b..e197502f94fd 100644
> >>> --- a/arch/arm64/kernel/smp.c
> >>> +++ b/arch/arm64/kernel/smp.c
> >>> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier)
> >>>   {
> >>>       return -EINVAL;
> >>>   }
> >>> +
> >>> +bool cpus_are_stuck_in_kernel(void)
> >>> +{
> >>> +    bool ret = !!cpus_stuck_in_kernel;
> >>> +#ifdef CONFIG_HOTPLUG_CPU
> >>> +    int any_cpu = raw_smp_processor_id();
> >>> +
> >>> +    if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
> >>> +        ret = true;
> >>> +#endif
> > 
> > Minor nit: Moving the cpu_die check to a static inline function with
> > an obvious name might make the code look better.
> > 
> >     return !!cpus_stuck_in_kernel || !have_cpu_die() ?
> > 
> 
> That would be better!

Can you post a new version of this, please?

Will
diff mbox

Patch

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 433e50405274..4be755bcc07a 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -124,6 +124,26 @@  static inline void cpu_panic_kernel(void)
 	cpu_park_loop();
 }
 
+/*
+ * Kernel features such as hibernate and kexec depend on cpu hotplug to know
+ * they can replace any kernel memory they are not using themselves.
+ *
+ * There are two corner cases:
+ * If a secondary CPU fails to come online, (e.g. due to mismatched features),
+ * it will try to call cpu_die(). If this fails, it increases the counter
+ * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing
+ * this function must not be re-used for anything else as the 'stuck' core
+ * is executing it.
+ *
+ * CPUs are also considered stuck in the kernel if we have multiple CPUs
+ * and no way to offline secondary CPUs. This happens when secondaries
+ * are released via spin-table, these CPUs are moved into the kernel's
+ * secondary_holding_pen, which must not be overwritten.
+ *
+ * This function is used to inhibit features like kexec and hibernate.
+ */
+bool cpus_are_stuck_in_kernel(void);
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 678e0842cb3b..e197502f94fd 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -909,3 +909,16 @@  int setup_profiling_timer(unsigned int multiplier)
 {
 	return -EINVAL;
 }
+
+bool cpus_are_stuck_in_kernel(void)
+{
+	bool ret = !!cpus_stuck_in_kernel;
+#ifdef CONFIG_HOTPLUG_CPU
+	int any_cpu = raw_smp_processor_id();
+
+	if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die)
+		ret = true;
+#endif
+
+	return ret;
+}