diff mbox series

arm64: smp: fix smp_send_stop() behaviour

Message ID 20190613122146.45459-1-cristian.marussi@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: smp: fix smp_send_stop() behaviour | expand

Commit Message

Cristian Marussi June 13, 2019, 12:21 p.m. UTC
On a 2-CPUs system, when one CPU is already online if the other
panics while starting-up, smp_send_stop() will fail to send any
STOP message to the other already online core, resulting in a
system still responsive and alive at the end of the panic procedure.
This patch makes smp_send_stop() account also for the online status
of the calling CPU while evaluating how many CPUs are effectively
online: this way, an adequate number of STOPs is sent, so enforcing
a proper freeze of the system at the end of panic even under the
above conditions.

Reported-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
---

This peculiar panic-procedure behaviour was exposed hitting a BUG()
while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
Such trigger-BUG() was fixed by a distinct commit already included
in Linux 5.2-rc4 [0]

[0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
---
 arch/arm64/kernel/smp.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Will Deacon June 17, 2019, 6:09 p.m. UTC | #1
[+James M]

On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
> On a 2-CPUs system, when one CPU is already online if the other
> panics while starting-up, smp_send_stop() will fail to send any
> STOP message to the other already online core, resulting in a
> system still responsive and alive at the end of the panic procedure.
> This patch makes smp_send_stop() account also for the online status
> of the calling CPU while evaluating how many CPUs are effectively
> online: this way, an adequate number of STOPs is sent, so enforcing
> a proper freeze of the system at the end of panic even under the
> above conditions.
> 
> Reported-by: Dave Martin <Dave.Martin@arm.com>
> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> ---
> 
> This peculiar panic-procedure behaviour was exposed hitting a BUG()
> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
> Such trigger-BUG() was fixed by a distinct commit already included
> in Linux 5.2-rc4 [0]
> 
> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
> ---
>  arch/arm64/kernel/smp.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index bb4b3f07761a..c7d604427883 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>  void smp_send_stop(void)
>  {
>  	unsigned long timeout;
> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>  
> -	if (num_online_cpus() > 1) {
> +	/*
> +	 * If this CPU isn't fully online, it will not be counted in
> +	 * num_online_cpus(): on a 2-CPU system this situation will
> +	 * result in no message being sent to the other already online CPU.
> +	 */
> +	if (num_online_cpus() > this_cpu_online) {
>  		cpumask_t mask;
>  
>  		cpumask_copy(&mask, cpu_online_mask);
> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>  
>  	/* Wait up to one second for other CPUs to stop */
>  	timeout = USEC_PER_SEC;
> -	while (num_online_cpus() > 1 && timeout--)
> +	while (num_online_cpus() > this_cpu_online && timeout--)
>  		udelay(1);
>  
> -	if (num_online_cpus() > 1)
> +	if (num_online_cpus() > this_cpu_online)
>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>  			   cpumask_pr_args(cpu_online_mask));

Whilst this looks ok to me, I'm worried about whether or not we have this
sort of logic elsewhere. For example, do we need to fix
crash_smp_send_stop() (and possibly machine_kexec()) too?

Will
James Morse June 18, 2019, 9:36 a.m. UTC | #2
Hi Christian, Will,

On 17/06/2019 19:09, Will Deacon wrote:
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.


>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>  	unsigned long timeout;
>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>  
>> -	if (num_online_cpus() > 1) {
>> +	/*
>> +	 * If this CPU isn't fully online, it will not be counted in
>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>> +	 * result in no message being sent to the other already online CPU.
>> +	 */
>> +	if (num_online_cpus() > this_cpu_online) {
>>  		cpumask_t mask;
>>  
>>  		cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>  
>>  	/* Wait up to one second for other CPUs to stop */
>>  	timeout = USEC_PER_SEC;
>> -	while (num_online_cpus() > 1 && timeout--)
>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>  		udelay(1);
>>  
>> -	if (num_online_cpus() > 1)
>> +	if (num_online_cpus() > this_cpu_online)
>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

We should do the same in crash_smp_send_stop(), its possible a newly-online core triggers
kdump by calling panic() in the same way.

machine_kexec() is called on the last surviving cpu after migrate_to_reboot_cpu() has run.
At first glance it looks like this could never happen there, but kexec re-enables
cpu-hotplug (commit 011e4b02f1da), and we can reschedule before we start moving memory
around, so I'm not convinced its immune...



Thanks,

James
Cristian Marussi June 18, 2019, 9:41 a.m. UTC | #3
Hi Will

Thanks for the review.

On 17/06/2019 19:09, Will Deacon wrote:
> [+James M]
> 
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.
>>
>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>> ---
>>
>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>> Such trigger-BUG() was fixed by a distinct commit already included
>> in Linux 5.2-rc4 [0]
>>
>> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>> ---
>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>  	unsigned long timeout;
>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>  
>> -	if (num_online_cpus() > 1) {
>> +	/*
>> +	 * If this CPU isn't fully online, it will not be counted in
>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>> +	 * result in no message being sent to the other already online CPU.
>> +	 */
>> +	if (num_online_cpus() > this_cpu_online) {
>>  		cpumask_t mask;
>>  
>>  		cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>  
>>  	/* Wait up to one second for other CPUs to stop */
>>  	timeout = USEC_PER_SEC;
>> -	while (num_online_cpus() > 1 && timeout--)
>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>  		udelay(1);
>>  
>> -	if (num_online_cpus() > 1)
>> +	if (num_online_cpus() > this_cpu_online)
>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

I think we certainly have such logic in crash_smp_send_stop() too at least,
maybe it is just less easily exposed given the different use case of the function.

This wanted to be just a fix only against the observed troubled panic, but I
could extend it to cover similar issues spotted by code analysis, if deemed worth.

Thanks

Cristian


> 
> Will
>
Cristian Marussi June 18, 2019, 9:56 a.m. UTC | #4
Hi Itaru

thanks for the review.

On 17/06/2019 20:58, Itaru Kitayama wrote:
> Could you avoid using the magic number like in udelay()?
> 

If you mean udelay(1) it is just that I avoided modifying anything which was not
strictly related to the fix addressed by this patch.

Thanks

Cristian

> On Thu, Jun 13, 2019 at 21:22 Cristian Marussi <cristian.marussi@arm.com>
> wrote:
> 
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.
>>
>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>> ---
>>
>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>> Such trigger-BUG() was fixed by a distinct commit already included
>> in Linux 5.2-rc4 [0]
>>
>> [0]
>> https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>> ---
>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>         unsigned long timeout;
>> +       unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>
>> -       if (num_online_cpus() > 1) {
>> +       /*
>> +        * If this CPU isn't fully online, it will not be counted in
>> +        * num_online_cpus(): on a 2-CPU system this situation will
>> +        * result in no message being sent to the other already online CPU.
>> +        */
>> +       if (num_online_cpus() > this_cpu_online) {
>>                 cpumask_t mask;
>>
>>                 cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>
>>         /* Wait up to one second for other CPUs to stop */
>>         timeout = USEC_PER_SEC;
>> -       while (num_online_cpus() > 1 && timeout--)
>> +       while (num_online_cpus() > this_cpu_online && timeout--)
>>                 udelay(1);
>>
>> -       if (num_online_cpus() > 1)
>> +       if (num_online_cpus() > this_cpu_online)
>>                 pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>                            cpumask_pr_args(cpu_online_mask));
>>
>> --
>> 2.17.1
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>
Cristian Marussi June 18, 2019, 9:58 a.m. UTC | #5
Hi James

On 18/06/2019 10:36, James Morse wrote:
> Hi Christian, Will,
> 
> On 17/06/2019 19:09, Will Deacon wrote:
>> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>>> On a 2-CPUs system, when one CPU is already online if the other
>>> panics while starting-up, smp_send_stop() will fail to send any
>>> STOP message to the other already online core, resulting in a
>>> system still responsive and alive at the end of the panic procedure.
>>> This patch makes smp_send_stop() account also for the online status
>>> of the calling CPU while evaluating how many CPUs are effectively
>>> online: this way, an adequate number of STOPs is sent, so enforcing
>>> a proper freeze of the system at the end of panic even under the
>>> above conditions.
> 
> 
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index bb4b3f07761a..c7d604427883 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>>  void smp_send_stop(void)
>>>  {
>>>  	unsigned long timeout;
>>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>>  
>>> -	if (num_online_cpus() > 1) {
>>> +	/*
>>> +	 * If this CPU isn't fully online, it will not be counted in
>>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>>> +	 * result in no message being sent to the other already online CPU.
>>> +	 */
>>> +	if (num_online_cpus() > this_cpu_online) {
>>>  		cpumask_t mask;
>>>  
>>>  		cpumask_copy(&mask, cpu_online_mask);
>>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>>  
>>>  	/* Wait up to one second for other CPUs to stop */
>>>  	timeout = USEC_PER_SEC;
>>> -	while (num_online_cpus() > 1 && timeout--)
>>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>>  		udelay(1);
>>>  
>>> -	if (num_online_cpus() > 1)
>>> +	if (num_online_cpus() > this_cpu_online)
>>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>>  			   cpumask_pr_args(cpu_online_mask));
>>
>> Whilst this looks ok to me, I'm worried about whether or not we have this
>> sort of logic elsewhere. For example, do we need to fix
>> crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> We should do the same in crash_smp_send_stop(), its possible a newly-online core triggers
> kdump by calling panic() in the same way.
> 
> machine_kexec() is called on the last surviving cpu after migrate_to_reboot_cpu() has run.
> At first glance it looks like this could never happen there, but kexec re-enables
> cpu-hotplug (commit 011e4b02f1da), and we can reschedule before we start moving memory
> around, so I'm not convinced its immune...
> 
> 

I'll look into machine_kexec() behavior in these regards.

Thanks for the review !

Cristian
> 
> Thanks,
> 
> James
>
Will Deacon June 18, 2019, 12:46 p.m. UTC | #6
On Tue, Jun 18, 2019 at 10:41:32AM +0100, Cristian Marussi wrote:
> On 17/06/2019 19:09, Will Deacon wrote:
> >> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >> index bb4b3f07761a..c7d604427883 100644
> >> --- a/arch/arm64/kernel/smp.c
> >> +++ b/arch/arm64/kernel/smp.c
> >> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
> >>  void smp_send_stop(void)
> >>  {
> >>  	unsigned long timeout;
> >> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
> >>  
> >> -	if (num_online_cpus() > 1) {
> >> +	/*
> >> +	 * If this CPU isn't fully online, it will not be counted in
> >> +	 * num_online_cpus(): on a 2-CPU system this situation will
> >> +	 * result in no message being sent to the other already online CPU.
> >> +	 */
> >> +	if (num_online_cpus() > this_cpu_online) {
> >>  		cpumask_t mask;
> >>  
> >>  		cpumask_copy(&mask, cpu_online_mask);
> >> @@ -985,10 +991,10 @@ void smp_send_stop(void)
> >>  
> >>  	/* Wait up to one second for other CPUs to stop */
> >>  	timeout = USEC_PER_SEC;
> >> -	while (num_online_cpus() > 1 && timeout--)
> >> +	while (num_online_cpus() > this_cpu_online && timeout--)
> >>  		udelay(1);
> >>  
> >> -	if (num_online_cpus() > 1)
> >> +	if (num_online_cpus() > this_cpu_online)
> >>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
> >>  			   cpumask_pr_args(cpu_online_mask));
> > 
> > Whilst this looks ok to me, I'm worried about whether or not we have this
> > sort of logic elsewhere. For example, do we need to fix
> > crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> I think we certainly have such logic in crash_smp_send_stop() too at least,
> maybe it is just less easily exposed given the different use case of the function.
> 
> This wanted to be just a fix only against the observed troubled panic, but I
> could extend it to cover similar issues spotted by code analysis, if deemed worth.

Yes, please. Makes sense to fix these all at once.

Will
Russell King (Oracle) June 18, 2019, 12:54 p.m. UTC | #7
On Mon, Jun 17, 2019 at 07:09:13PM +0100, Will Deacon wrote:
> [+James M]
> 
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
> > On a 2-CPUs system, when one CPU is already online if the other
> > panics while starting-up, smp_send_stop() will fail to send any
> > STOP message to the other already online core, resulting in a
> > system still responsive and alive at the end of the panic procedure.
> > This patch makes smp_send_stop() account also for the online status
> > of the calling CPU while evaluating how many CPUs are effectively
> > online: this way, an adequate number of STOPs is sent, so enforcing
> > a proper freeze of the system at the end of panic even under the
> > above conditions.
> > 
> > Reported-by: Dave Martin <Dave.Martin@arm.com>
> > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> > ---
> > 
> > This peculiar panic-procedure behaviour was exposed hitting a BUG()
> > while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
> > Such trigger-BUG() was fixed by a distinct commit already included
> > in Linux 5.2-rc4 [0]
> > 
> > [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
> > ---
> >  arch/arm64/kernel/smp.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index bb4b3f07761a..c7d604427883 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
> >  void smp_send_stop(void)
> >  {
> >  	unsigned long timeout;
> > +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
> >  
> > -	if (num_online_cpus() > 1) {
> > +	/*
> > +	 * If this CPU isn't fully online, it will not be counted in
> > +	 * num_online_cpus(): on a 2-CPU system this situation will
> > +	 * result in no message being sent to the other already online CPU.
> > +	 */
> > +	if (num_online_cpus() > this_cpu_online) {
> >  		cpumask_t mask;
> >  
> >  		cpumask_copy(&mask, cpu_online_mask);
> > @@ -985,10 +991,10 @@ void smp_send_stop(void)
> >  
> >  	/* Wait up to one second for other CPUs to stop */
> >  	timeout = USEC_PER_SEC;
> > -	while (num_online_cpus() > 1 && timeout--)
> > +	while (num_online_cpus() > this_cpu_online && timeout--)
> >  		udelay(1);
> >  
> > -	if (num_online_cpus() > 1)
> > +	if (num_online_cpus() > this_cpu_online)
> >  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
> >  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

What about other architectures?  This, or very similar code, is present
on other architectures too.
Cristian Marussi June 18, 2019, 5:40 p.m. UTC | #8
Hi

On 18/06/2019 13:54, Russell King - ARM Linux admin wrote:
> On Mon, Jun 17, 2019 at 07:09:13PM +0100, Will Deacon wrote:
>> [+James M]
>>
>> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>>> On a 2-CPUs system, when one CPU is already online if the other
>>> panics while starting-up, smp_send_stop() will fail to send any
>>> STOP message to the other already online core, resulting in a
>>> system still responsive and alive at the end of the panic procedure.
>>> This patch makes smp_send_stop() account also for the online status
>>> of the calling CPU while evaluating how many CPUs are effectively
>>> online: this way, an adequate number of STOPs is sent, so enforcing
>>> a proper freeze of the system at the end of panic even under the
>>> above conditions.
>>>
>>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>>> ---
>>>
>>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>>> Such trigger-BUG() was fixed by a distinct commit already included
>>> in Linux 5.2-rc4 [0]
>>>
>>> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>>> ---
>>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index bb4b3f07761a..c7d604427883 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>>  void smp_send_stop(void)
>>>  {
>>>  	unsigned long timeout;
>>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>>  
>>> -	if (num_online_cpus() > 1) {
>>> +	/*
>>> +	 * If this CPU isn't fully online, it will not be counted in
>>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>>> +	 * result in no message being sent to the other already online CPU.
>>> +	 */
>>> +	if (num_online_cpus() > this_cpu_online) {
>>>  		cpumask_t mask;
>>>  
>>>  		cpumask_copy(&mask, cpu_online_mask);
>>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>>  
>>>  	/* Wait up to one second for other CPUs to stop */
>>>  	timeout = USEC_PER_SEC;
>>> -	while (num_online_cpus() > 1 && timeout--)
>>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>>  		udelay(1);
>>>  
>>> -	if (num_online_cpus() > 1)
>>> +	if (num_online_cpus() > this_cpu_online)
>>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>>  			   cpumask_pr_args(cpu_online_mask));
>>
>> Whilst this looks ok to me, I'm worried about whether or not we have this
>> sort of logic elsewhere. For example, do we need to fix
>> crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> What about other architectures?  This, or very similar code, is present
> on other architectures too.
> 
Thanks for the review,

indeed glancing quickly at other arch's, there is a lot of common things and
subtle differences in handling these situations...potentially some common logic
to fix once for all and abstract out of arch specific code...(or at least try to
do it)....so now a question arises to me: is it still worth to properly fix on
arm64 at first extending this patch (easier and quicker maybe) or should I opt
directly for the abstraction party ?

Thanks

Cristian
diff mbox series

Patch

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index bb4b3f07761a..c7d604427883 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -971,8 +971,14 @@  void tick_broadcast(const struct cpumask *mask)
 void smp_send_stop(void)
 {
 	unsigned long timeout;
+	unsigned int this_cpu_online = cpu_online(smp_processor_id());
 
-	if (num_online_cpus() > 1) {
+	/*
+	 * If this CPU isn't fully online, it will not be counted in
+	 * num_online_cpus(): on a 2-CPU system this situation will
+	 * result in no message being sent to the other already online CPU.
+	 */
+	if (num_online_cpus() > this_cpu_online) {
 		cpumask_t mask;
 
 		cpumask_copy(&mask, cpu_online_mask);
@@ -985,10 +991,10 @@  void smp_send_stop(void)
 
 	/* Wait up to one second for other CPUs to stop */
 	timeout = USEC_PER_SEC;
-	while (num_online_cpus() > 1 && timeout--)
+	while (num_online_cpus() > this_cpu_online && timeout--)
 		udelay(1);
 
-	if (num_online_cpus() > 1)
+	if (num_online_cpus() > this_cpu_online)
 		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
 			   cpumask_pr_args(cpu_online_mask));