Message ID | 1466156097-20028-2-git-send-email-james.morse@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote: > kernel/smp.c has a fancy counter that keeps track of the number of CPUs > it marked as not-present and left in cpu_park_loop(). If there are any > CPUs spinning in here, features like kexec or hibernate may release them > by overwriting this memory. > > This problem also occurs on machines using spin-tables to release > secondary cores. > After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N") > we bring all known cpus into the secondary holding pen, but may not bring > them up depending on 'maxcpus'. This memory can't be re-used by kexec > or hibernate. > > Add a function cpus_are_stuck_in_kernel() to determine if either of these > cases have occurred. > > Signed-off-by: James Morse <james.morse@arm.com> > Cc: Suzuki K Poulose <suzuki.poulose@arm.com> > --- > arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++ > arch/arm64/kernel/smp.c | 13 +++++++++++++ > 2 files changed, 33 insertions(+) > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h > index 433e50405274..4be755bcc07a 100644 > --- a/arch/arm64/include/asm/smp.h > +++ b/arch/arm64/include/asm/smp.h > @@ -124,6 +124,26 @@ static inline void cpu_panic_kernel(void) > cpu_park_loop(); > } > > +/* > + * Kernel features such as hibernate and kexec depend on cpu hotplug to know > + * they can replace any kernel memory they are not using themselves. > + * > + * There are two corner cases: > + * If a secondary CPU fails to come online, (e.g. due to mismatched features), > + * it will try to call cpu_die(). If this fails, it increases the counter > + * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing > + * this function must not be re-used for anything else as the 'stuck' core > + * is executing it. It might also be stuck in __no_granule_support, if it never made it to C code. In that case, the CPU in charge of bringing up that new CPU will increment the counter in __cpu_up. There might be other reasons we do something like that in future, so it might be better to be a little less specific and say something like: If a secondary CPU enters the kernel but fails to come online, (e.g. due to mismatched features), and cannot exit the kernel, we increment cpus_stuck_in_kernel and leave the CPU in a quiesecent loop within the kernel text. The memory containing this loop must not be re-used for anything else as the 'stuck' core is executing it. Otherwise, this looks good. FWIW, either way: Acked-by: Mark Rutland <mark.rutland@arm.com> Thanks, Mark. > + * > + * CPUs are also considered stuck in the kernel if we have multiple CPUs > + * and no way to offline secondary CPUs. This happens when secondaries > + * are released via spin-table, these CPUs are moved into the kernel's > + * secondary_holding_pen, which must not be overwritten. > + * > + * This function is used to inhibit features like kexec and hibernate. > + */ > +bool cpus_are_stuck_in_kernel(void); > + > #endif /* ifndef __ASSEMBLY__ */ > > #endif /* ifndef __ASM_SMP_H */ > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 678e0842cb3b..e197502f94fd 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier) > { > return -EINVAL; > } > + > +bool cpus_are_stuck_in_kernel(void) > +{ > + bool ret = !!cpus_stuck_in_kernel; > +#ifdef CONFIG_HOTPLUG_CPU > + int any_cpu = raw_smp_processor_id(); > + > + if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) > + ret = true; > +#endif > + > + return ret; > +} > -- > 2.8.0.rc3 >
On 17/06/16 11:27, Mark Rutland wrote: > On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote: >> kernel/smp.c has a fancy counter that keeps track of the number of CPUs >> it marked as not-present and left in cpu_park_loop(). If there are any >> CPUs spinning in here, features like kexec or hibernate may release them >> by overwriting this memory. >> >> This problem also occurs on machines using spin-tables to release >> secondary cores. >> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N") >> we bring all known cpus into the secondary holding pen, but may not bring >> them up depending on 'maxcpus'. This memory can't be re-used by kexec >> or hibernate. >> >> Add a function cpus_are_stuck_in_kernel() to determine if either of these >> cases have occurred. >> >> Signed-off-by: James Morse <james.morse@arm.com> >> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> >> --- >> arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++ >> arch/arm64/kernel/smp.c | 13 +++++++++++++ >> 2 files changed, 33 insertions(+) >> >> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h >> index 433e50405274..4be755bcc07a 100644 >> --- a/arch/arm64/include/asm/smp.h >> +++ b/arch/arm64/include/asm/smp.h >> @@ -124,6 +124,26 @@ static inline void cpu_panic_kernel(void) >> cpu_park_loop(); >> } >> >> +/* >> + * Kernel features such as hibernate and kexec depend on cpu hotplug to know >> + * they can replace any kernel memory they are not using themselves. >> + * >> + * There are two corner cases: >> + * If a secondary CPU fails to come online, (e.g. due to mismatched features), >> + * it will try to call cpu_die(). If this fails, it increases the counter >> + * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing >> + * this function must not be re-used for anything else as the 'stuck' core >> + * is executing it. > > It might also be stuck in __no_granule_support, if it never made it to C > code. In that case, the CPU in charge of bringing up that new CPU will > increment the counter in __cpu_up. Just to clarify, *in all the cases*, the CPU in charge of bringing up updates the cpus_stuck_in_kernel. > > There might be other reasons we do something like that in future, so it > might be better to be a little less specific and say something like: > > If a secondary CPU enters the kernel but fails to come online, > (e.g. due to mismatched features), and cannot exit the kernel, > we increment cpus_stuck_in_kernel and leave the CPU in a > quiesecent loop within the kernel text. The memory containing > this loop must not be re-used for anything else as the 'stuck' > core is executing it. Agree. >> +bool cpus_are_stuck_in_kernel(void); >> + >> #endif /* ifndef __ASSEMBLY__ */ >> >> #endif /* ifndef __ASM_SMP_H */ >> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c >> index 678e0842cb3b..e197502f94fd 100644 >> --- a/arch/arm64/kernel/smp.c >> +++ b/arch/arm64/kernel/smp.c >> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier) >> { >> return -EINVAL; >> } >> + >> +bool cpus_are_stuck_in_kernel(void) >> +{ >> + bool ret = !!cpus_stuck_in_kernel; >> +#ifdef CONFIG_HOTPLUG_CPU >> + int any_cpu = raw_smp_processor_id(); >> + >> + if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) >> + ret = true; >> +#endif Minor nit: Moving the cpu_die check to a static inline function with an obvious name might make the code look better. return !!cpus_stuck_in_kernel || !have_cpu_die() ? Eitherway, Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Cheers Suzuki >> + >> + return ret; >> +} >> -- >> 2.8.0.rc3 >> >
On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote: > +bool cpus_are_stuck_in_kernel(void) > +{ > +> > bool ret = !!cpus_stuck_in_kernel; > +#ifdef CONFIG_HOTPLUG_CPU How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here? > + int any_cpu = raw_smp_processor_id(); > + > +> > if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) > +> > > ret = true; > +#endif -Geoff
On 17/06/16 17:37, Geoff Levand wrote: > On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote: >> +bool cpus_are_stuck_in_kernel(void) >> +{ >> +> > bool ret = !!cpus_stuck_in_kernel; >> +#ifdef CONFIG_HOTPLUG_CPU > > How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here? > >> + int any_cpu = raw_smp_processor_id(); >> + >> +> > if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) >> +> > > ret = true; >> +#endif > That won't work, as the cpu_ops->cpu_die is defined only if CONFIG_HOTPLUG_CPU. May be we should fix that. Suzuki
On Fri, Jun 17, 2016 at 09:37:12AM -0700, Geoff Levand wrote: > On Fri, 2016-06-17 at 10:34 +0100, James Morse wrote: > > +bool cpus_are_stuck_in_kernel(void) > > +{ > > +> > bool ret = !!cpus_stuck_in_kernel; > > +#ifdef CONFIG_HOTPLUG_CPU > > How about using 'if (IS_ENABLED(CONFIG_HOTPLUG_CPU))' here? I wanted to suggest that too, but ... > > + int any_cpu = raw_smp_processor_id(); > > + > > +> > if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) > > +> > > ret = true; ... here we'd blow up as cpu_ops::cpu_die is only defined when CONFIG_HOTPLUG_CPU is selected. So we'll have to use an ifdef, but we could instead ifdef a whole stub function (e.g. have_cpu_die()), as Suzuki suggested. Thanks, Mark.
Hi Suzuki, On 17/06/16 12:16, Suzuki K Poulose wrote: > On 17/06/16 11:27, Mark Rutland wrote: >> On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote: >>> kernel/smp.c has a fancy counter that keeps track of the number of CPUs >>> it marked as not-present and left in cpu_park_loop(). If there are any >>> CPUs spinning in here, features like kexec or hibernate may release them >>> by overwriting this memory. >>> >>> This problem also occurs on machines using spin-tables to release >>> secondary cores. >>> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N") >>> we bring all known cpus into the secondary holding pen, but may not bring >>> them up depending on 'maxcpus'. This memory can't be re-used by kexec >>> or hibernate. >>> >>> Add a function cpus_are_stuck_in_kernel() to determine if either of these >>> cases have occurred. >> It might also be stuck in __no_granule_support, if it never made it to C >> code. In that case, the CPU in charge of bringing up that new CPU will >> increment the counter in __cpu_up. > > Just to clarify, *in all the cases*, the CPU in charge of bringing up updates > the cpus_stuck_in_kernel. Ah, my mistake. I will switch it for Mark's suggestion. >>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c >>> index 678e0842cb3b..e197502f94fd 100644 >>> --- a/arch/arm64/kernel/smp.c >>> +++ b/arch/arm64/kernel/smp.c >>> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier) >>> { >>> return -EINVAL; >>> } >>> + >>> +bool cpus_are_stuck_in_kernel(void) >>> +{ >>> + bool ret = !!cpus_stuck_in_kernel; >>> +#ifdef CONFIG_HOTPLUG_CPU >>> + int any_cpu = raw_smp_processor_id(); >>> + >>> + if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) >>> + ret = true; >>> +#endif > > Minor nit: Moving the cpu_die check to a static inline function with > an obvious name might make the code look better. > > return !!cpus_stuck_in_kernel || !have_cpu_die() ? > That would be better! > Eitherway, > > Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Thanks, James
HI James, On Fri, Jun 17, 2016 at 05:48:35PM +0100, James Morse wrote: > On 17/06/16 12:16, Suzuki K Poulose wrote: > > On 17/06/16 11:27, Mark Rutland wrote: > >> On Fri, Jun 17, 2016 at 10:34:56AM +0100, James Morse wrote: > >>> kernel/smp.c has a fancy counter that keeps track of the number of CPUs > >>> it marked as not-present and left in cpu_park_loop(). If there are any > >>> CPUs spinning in here, features like kexec or hibernate may release them > >>> by overwriting this memory. > >>> > >>> This problem also occurs on machines using spin-tables to release > >>> secondary cores. > >>> After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N") > >>> we bring all known cpus into the secondary holding pen, but may not bring > >>> them up depending on 'maxcpus'. This memory can't be re-used by kexec > >>> or hibernate. > >>> > >>> Add a function cpus_are_stuck_in_kernel() to determine if either of these > >>> cases have occurred. > > >> It might also be stuck in __no_granule_support, if it never made it to C > >> code. In that case, the CPU in charge of bringing up that new CPU will > >> increment the counter in __cpu_up. > > > > Just to clarify, *in all the cases*, the CPU in charge of bringing up updates > > the cpus_stuck_in_kernel. > > Ah, my mistake. I will switch it for Mark's suggestion. > > >>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > >>> index 678e0842cb3b..e197502f94fd 100644 > >>> --- a/arch/arm64/kernel/smp.c > >>> +++ b/arch/arm64/kernel/smp.c > >>> @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier) > >>> { > >>> return -EINVAL; > >>> } > >>> + > >>> +bool cpus_are_stuck_in_kernel(void) > >>> +{ > >>> + bool ret = !!cpus_stuck_in_kernel; > >>> +#ifdef CONFIG_HOTPLUG_CPU > >>> + int any_cpu = raw_smp_processor_id(); > >>> + > >>> + if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) > >>> + ret = true; > >>> +#endif > > > > Minor nit: Moving the cpu_die check to a static inline function with > > an obvious name might make the code look better. > > > > return !!cpus_stuck_in_kernel || !have_cpu_die() ? > > > > That would be better! Can you post a new version of this, please? Will
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index 433e50405274..4be755bcc07a 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -124,6 +124,26 @@ static inline void cpu_panic_kernel(void) cpu_park_loop(); } +/* + * Kernel features such as hibernate and kexec depend on cpu hotplug to know + * they can replace any kernel memory they are not using themselves. + * + * There are two corner cases: + * If a secondary CPU fails to come online, (e.g. due to mismatched features), + * it will try to call cpu_die(). If this fails, it increases the counter + * cpus_stuck_in_kernel and sits in cpu_park_loop(). The memory containing + * this function must not be re-used for anything else as the 'stuck' core + * is executing it. + * + * CPUs are also considered stuck in the kernel if we have multiple CPUs + * and no way to offline secondary CPUs. This happens when secondaries + * are released via spin-table, these CPUs are moved into the kernel's + * secondary_holding_pen, which must not be overwritten. + * + * This function is used to inhibit features like kexec and hibernate. + */ +bool cpus_are_stuck_in_kernel(void); + #endif /* ifndef __ASSEMBLY__ */ #endif /* ifndef __ASM_SMP_H */ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 678e0842cb3b..e197502f94fd 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -909,3 +909,16 @@ int setup_profiling_timer(unsigned int multiplier) { return -EINVAL; } + +bool cpus_are_stuck_in_kernel(void) +{ + bool ret = !!cpus_stuck_in_kernel; +#ifdef CONFIG_HOTPLUG_CPU + int any_cpu = raw_smp_processor_id(); + + if (num_possible_cpus() > 1 && !cpu_ops[any_cpu]->cpu_die) + ret = true; +#endif + + return ret; +}
kernel/smp.c has a fancy counter that keeps track of the number of CPUs it marked as not-present and left in cpu_park_loop(). If there are any CPUs spinning in here, features like kexec or hibernate may release them by overwriting this memory. This problem also occurs on machines using spin-tables to release secondary cores. After commit 44dbcc93ab67 ("arm64: Fix behavior of maxcpus=N") we bring all known cpus into the secondary holding pen, but may not bring them up depending on 'maxcpus'. This memory can't be re-used by kexec or hibernate. Add a function cpus_are_stuck_in_kernel() to determine if either of these cases have occurred. Signed-off-by: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> --- arch/arm64/include/asm/smp.h | 20 ++++++++++++++++++++ arch/arm64/kernel/smp.c | 13 +++++++++++++ 2 files changed, 33 insertions(+)