diff mbox series

cpu/hotplug: Fix rollback during error-out in takedown_cpu()

Message ID 1536042803-6152-1-git-send-email-neeraju@codeaurora.org (mailing list archive)
State Not Applicable, archived
Delegated to: Andy Gross
Headers show
Series cpu/hotplug: Fix rollback during error-out in takedown_cpu() | expand

Commit Message

Neeraj Upadhyay Sept. 4, 2018, 6:33 a.m. UTC
If takedown_cpu() fails during _cpu_down(), st->state is reset,
by calling cpuhp_reset_state(). This results in an additional
increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
state being skipped during rollback. Fix this by not calling
cpuhp_reset_state() and doing the state reset directly in
_cpu_down().

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
---
 kernel/cpu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Mukesh Ojha Sept. 4, 2018, 12:12 p.m. UTC | #1
On 9/4/2018 12:03 PM, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
>
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> ---
>   kernel/cpu.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>   	 */
>   	ret = cpuhp_down_callbacks(cpu, st, target);
>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> -		cpuhp_reset_state(st, prev_state);
> +		/*
> +		 * As st->last is not set, cpuhp_reset_state() increments
> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> +		 * skipped during rollback. So, don't use it here.
> +		 */
> +		st->rollback = true;
> +		st->target = prev_state;
> +		st->bringup = !st->bringup;

Acked-by: Mukesh Ojha <mojha@codeaurora.org>

looks valid for this case.

>   		__cpuhp_kick_ap(st);
>   	}
>
Thomas Gleixner Sept. 5, 2018, 11:33 a.m. UTC | #2
On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
> 
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> ---
>  kernel/cpu.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>  	 */
>  	ret = cpuhp_down_callbacks(cpu, st, target);
>  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> -		cpuhp_reset_state(st, prev_state);
> +		/*
> +		 * As st->last is not set, cpuhp_reset_state() increments
> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> +		 * skipped during rollback. So, don't use it here.
> +		 */
> +		st->rollback = true;
> +		st->target = prev_state;
> +		st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.

Thanks,

	tglx
Mukesh Ojha Sept. 5, 2018, 12:09 p.m. UTC | #3
On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>> If takedown_cpu() fails during _cpu_down(), st->state is reset,
>> by calling cpuhp_reset_state(). This results in an additional
>> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
>> state being skipped during rollback. Fix this by not calling
>> cpuhp_reset_state() and doing the state reset directly in
>> _cpu_down().
>>
>> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>> ---
>>   kernel/cpu.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index aa7fe85..9f49edb 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>>   	 */
>>   	ret = cpuhp_down_callbacks(cpu, st, target);
>>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>> -		cpuhp_reset_state(st, prev_state);
>> +		/*
>> +		 * As st->last is not set, cpuhp_reset_state() increments
>> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>> +		 * skipped during rollback. So, don't use it here.
>> +		 */
>> +		st->rollback = true;
>> +		st->target = prev_state;
>> +		st->bringup = !st->bringup;
> No, this is just papering over the actual problem.
>
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
>
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

Could this be done like below ?

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..47bce90 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
         int err, cpu = smp_processor_id();
         int ret;

-       /* Ensure this CPU doesn't handle any more interrupts. */
-       err = __cpu_disable();
-       if (err < 0)
-               return err;
-
         /*
          * We get here while we are in CPUHP_TEARDOWN_CPU state and we 
must not
          * do this step again.
          */
         WARN_ON(st->state != CPUHP_TEARDOWN_CPU);
         st->state--;
+
+       /* Ensure this CPU doesn't handle any more interrupts. */
+       err = __cpu_disable();
+       if (err < 0)
+               return err;
+
         /* Invoke the former CPU_DYING callbacks */

Thanks,
Mukesh
>
>
>
>
>
>
Thomas Gleixner Sept. 5, 2018, 12:23 p.m. UTC | #4
On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> >  	ret = cpuhp_down_callbacks(cpu, st, target);
> >  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > -		cpuhp_reset_state(st, prev_state);
> > +		/*
> > +		 * As st->last is not set, cpuhp_reset_state() increments
> > +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > +		 * skipped during rollback. So, don't use it here.
> > +		 */
> > +		st->rollback = true;
> > +		st->target = prev_state;
> > +		st->bringup = !st->bringup;
> 
> No, this is just papering over the actual problem.
> 
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
> 
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.

Patch below.

Thanks,

	tglx
----
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
 		ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
 		if (ret) {
 			st->target = prev_state;
-			undo_cpu_down(cpu, st);
+			if (st->state < prev_state)
+				undo_cpu_down(cpu, st);
 			break;
 		}
 	}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
 	 * to do the further cleanups.
 	 */
 	ret = cpuhp_down_callbacks(cpu, st, target);
-	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+	if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
 		cpuhp_reset_state(st, prev_state);
 		__cpuhp_kick_ap(st);
 	}
Neeraj Upadhyay Sept. 5, 2018, 12:51 p.m. UTC | #5
On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
>> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>>>   	ret = cpuhp_down_callbacks(cpu, st, target);
>>>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>>> -		cpuhp_reset_state(st, prev_state);
>>> +		/*
>>> +		 * As st->last is not set, cpuhp_reset_state() increments
>>> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>>> +		 * skipped during rollback. So, don't use it here.
>>> +		 */
>>> +		st->rollback = true;
>>> +		st->target = prev_state;
>>> +		st->bringup = !st->bringup;
>> No, this is just papering over the actual problem.
>>
>> The state inconsistency happens in take_cpu_down() when it returns with a
>> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
>> and st->state is then incremented in undo_cpu_down().
>>
>> That's the real issue and we need to analyze the whole cpu_down rollback
>> logic first.
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
>
> Patch below.
>
> Thanks,
>
> 	tglx
As per my understanding, there are 2 problems here; one is fixed with 
your patch, and other is cpuhp_reset_state() is used during rollback 
from non-AP to AP state, which seem to result in 2 increments of 
st->state (one increment done by cpuhp_reset_state() and another by 
cpu_thread_fun()) .
> ----
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
>   		ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
>   		if (ret) {
>   			st->target = prev_state;
> -			undo_cpu_down(cpu, st);
> +			if (st->state < prev_state)
> +				undo_cpu_down(cpu, st);
>   			break;
>   		}
>   	}
> @@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
>   	 * to do the further cleanups.
>   	 */
>   	ret = cpuhp_down_callbacks(cpu, st, target);
> -	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> +	if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>   		cpuhp_reset_state(st, prev_state);
>   		__cpuhp_kick_ap(st);
>   	}
Thomas Gleixner Sept. 5, 2018, 1:14 p.m. UTC | #6
On Wed, 5 Sep 2018, Mukesh Ojha wrote:
> On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> > > +		st->rollback = true;
> > > +		st->target = prev_state;
> > > +		st->bringup = !st->bringup;
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> Could this be done like below ?

IOW, more papering over the real root cause.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..47bce90 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
>         int err, cpu = smp_processor_id();
>         int ret;

  ^^^^^^^

Please fix your mailer or your editor. That patch is whitespace damaged.

Thanks,

	tglx
Thomas Gleixner Sept. 5, 2018, 1:17 p.m. UTC | #7
On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
>
> As per my understanding, there are 2 problems here; one is fixed with your
> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
> AP state, which seem to result in 2 increments of st->state (one increment
> done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

	tglx
Sudeep Holla Sept. 5, 2018, 2:53 p.m. UTC | #8
On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > >  	ret = cpuhp_down_callbacks(cpu, st, target);
> > >  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > -		cpuhp_reset_state(st, prev_state);
> > > +		/*
> > > +		 * As st->last is not set, cpuhp_reset_state() increments
> > > +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > +		 * skipped during rollback. So, don't use it here.
> > > +		 */
> > > +		st->rollback = true;
> > > +		st->target = prev_state;
> > > +		st->bringup = !st->bringup;
> > 
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
> 
> Patch below.

This patch fixes the issue reported @[1]. Lorenzo did some debugging and
I wanted to have a look at it at some point but this discussion drew my
attention and sounded very similar[2]. So I did a quick test with this
patch and it fixes the issue.

--
Regards,
Sudeep

[1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/
Geert Uytterhoeven Sept. 5, 2018, 3:19 p.m. UTC | #9
Hi Sudeep,

Thanks for the CC!

On Wed, Sep 5, 2018 at 4:54 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > > >   ret = cpuhp_down_callbacks(cpu, st, target);
> > > >   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > > -         cpuhp_reset_state(st, prev_state);
> > > > +         /*
> > > > +          * As st->last is not set, cpuhp_reset_state() increments
> > > > +          * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > > +          * skipped during rollback. So, don't use it here.
> > > > +          */
> > > > +         st->rollback = true;
> > > > +         st->target = prev_state;
> > > > +         st->bringup = !st->bringup;
> > >
> > > No, this is just papering over the actual problem.
> > >
> > > The state inconsistency happens in take_cpu_down() when it returns with a
> > > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > > and st->state is then incremented in undo_cpu_down().
> > >
> > > That's the real issue and we need to analyze the whole cpu_down rollback
> > > logic first.
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
> > Patch below.
>
> This patch fixes the issue reported @[1]. Lorenzo did some debugging and
> I wanted to have a look at it at some point but this discussion drew my
> attention and sounded very similar[2]. So I did a quick test with this
> patch and it fixes the issue.

> [1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/

Thomas' patch fixes the issue for me:
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>


Gr{oetje,eeting}s,

                        Geert
Neeraj Upadhyay Sept. 6, 2018, 2:56 a.m. UTC | #10
On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>> makes it simple to observe. It's universaly broken, when the first teardown
>>> callback fails because, st->state is only decremented _AFTER_ the callback
>>> returns success, but undo_cpu_down() increments unconditionally.
>>>
>> As per my understanding, there are 2 problems here; one is fixed with your
>> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
>> AP state, which seem to result in 2 increments of st->state (one increment
>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> And how did your hack fix that up magically? I'll have a look later today.
>
> Thanks,
>
> 	tglx

The hack fixes it by not calling cpuhp_reset_state() and doing rollback 
state reset inline in  _cpu_down().
Thomas Gleixner Sept. 6, 2018, 8:18 a.m. UTC | #11
On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> > > On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> > > > And looking closer this is a general issue. Just that the TEARDOWN state
> > > > makes it simple to observe. It's universaly broken, when the first
> > > > teardown
> > > > callback fails because, st->state is only decremented _AFTER_ the
> > > > callback
> > > > returns success, but undo_cpu_down() increments unconditionally.
> > > > 
> > > As per my understanding, there are 2 problems here; one is fixed with your
> > > patch, and other is cpuhp_reset_state() is used during rollback from
> > > non-AP to
> > > AP state, which seem to result in 2 increments of st->state (one increment
> > > done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> > And how did your hack fix that up magically? I'll have a look later today.
> > 
> > Thanks,
> > 
> > 	tglx
> 
> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
> reset inline in  _cpu_down().

And how is that any different from the proper patch I sent? It ends up in
the same state. So I have a hard time to understand your blurb above where
you claim that my patch just solves one of two problems?

Thanks,

	tglx
Neeraj Upadhyay Sept. 6, 2018, 9 a.m. UTC | #12
On 09/06/2018 01:48 PM, Thomas Gleixner wrote:
> On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
>>> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>>>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>>>> makes it simple to observe. It's universaly broken, when the first
>>>>> teardown
>>>>> callback fails because, st->state is only decremented _AFTER_ the
>>>>> callback
>>>>> returns success, but undo_cpu_down() increments unconditionally.
>>>>>
>>>> As per my understanding, there are 2 problems here; one is fixed with your
>>>> patch, and other is cpuhp_reset_state() is used during rollback from
>>>> non-AP to
>>>> AP state, which seem to result in 2 increments of st->state (one increment
>>>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
>>> And how did your hack fix that up magically? I'll have a look later today.
>>>
>>> Thanks,
>>>
>>> 	tglx
>>
>> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
>> reset inline in  _cpu_down().
> 
> And how is that any different from the proper patch I sent? It ends up in
> the same state. So I have a hard time to understand your blurb above where
> you claim that my patch just solves one of two problems?
> 
> Thanks,
> 
> 	tglx
>

Yes, your patch solves the problem related to smpboot unparking being 
skipped during rollback and with the hack we end up in same state. The 
second thing, which I am referring to, is that there is one additional 
state increment. I missed the part that, it could be required, so that 
we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's 
not a problem.

* 	cpuhp_reset_state() does one state increment and we reach 
CPUHP_AP_ONLINE_IDLE.

	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
		cpuhp_reset_state(st, prev_state);
		__cpuhp_kick_ap(st);
	}

         CPUHP_TEARDOWN_CPU,
         CPUHP_AP_ONLINE_IDLE,
         CPUHP_AP_SMPBOOT_THREADS,


* cpuhp_thread_fun() does one more increment before invoking state 
callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach 
CPUHP_AP_SMPBOOT_THREADS:

	static void cpuhp_thread_fun(unsigned int cpu)
	<snip>
		if (bringup) {
			st->state++;
			state = st->state;
	<snip>
diff mbox series

Patch

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..9f49edb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -970,7 +970,14 @@  static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	 */
 	ret = cpuhp_down_callbacks(cpu, st, target);
 	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
-		cpuhp_reset_state(st, prev_state);
+		/*
+		 * As st->last is not set, cpuhp_reset_state() increments
+		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+		 * skipped during rollback. So, don't use it here.
+		 */
+		st->rollback = true;
+		st->target = prev_state;
+		st->bringup = !st->bringup;
 		__cpuhp_kick_ap(st);
 	}