diff mbox series

[v2,3/7] cpu/hotplug: Add dynamic parallel bringup states before CPUHP_BRINGUP_CPU

Message ID 20211214123250.88230-4-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Parallel CPU bringup for x86_64 | expand

Commit Message

David Woodhouse Dec. 14, 2021, 12:32 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

If the platform registers these states, bring all CPUs to each registered
state in turn, before the final bringup to CPUHP_BRINGUP_CPU. This allows
the architecture to parallelise the slow asynchronous tasks like sending
INIT/SIPI and waiting for the AP to come to life.

There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
this means that *all* CPUs are brought through the prepare states and to
CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
and then are allowed to run for themselves to CPUHP_ONLINE.

So any combination of prepare/start calls which depend on A-B ordering
for each CPU in turn, such as the X2APIC code which used to allocate a
cluster mask 'just in case' and store it in a global variable in the
prep stage, then potentially consume that preallocated structure from
the AP and set the global pointer to NULL to be reallocated in
CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.

We believe that X2APIC was the only such case, for x86. But this is why
it remains an architecture opt-in. For now.

Note that the new parallel stages do *not* yet bring each AP to the
CPUHP_BRINGUP_CPU state. The final loop in bringup_nonboot_cpus() is
untouched, bringing each AP in turn from the final PARALLEL_DYN state
(or all the way from CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then
waiting for that AP to do its own processing and reach CPUHP_ONLINE
before releasing the next. Parallelising that part by bringing them all
to CPUHP_BRINGUP_CPU and then waiting for them all is an exercise for
the future.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/cpuhotplug.h |  2 ++
 kernel/cpu.c               | 27 +++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

Comments

Mark Rutland Dec. 14, 2021, 2:24 p.m. UTC | #1
On Tue, Dec 14, 2021 at 12:32:46PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> If the platform registers these states, bring all CPUs to each registered
> state in turn, before the final bringup to CPUHP_BRINGUP_CPU. This allows
> the architecture to parallelise the slow asynchronous tasks like sending
> INIT/SIPI and waiting for the AP to come to life.
> 
> There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
> this means that *all* CPUs are brought through the prepare states and to
> CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
> and then are allowed to run for themselves to CPUHP_ONLINE.
> 
> So any combination of prepare/start calls which depend on A-B ordering
> for each CPU in turn, such as the X2APIC code which used to allocate a
> cluster mask 'just in case' and store it in a global variable in the
> prep stage, then potentially consume that preallocated structure from
> the AP and set the global pointer to NULL to be reallocated in
> CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.
> 
> We believe that X2APIC was the only such case, for x86. But this is why
> it remains an architecture opt-in. For now.

It might be worth elaborating with a non-x86 example, e.g.

|  We believe that X2APIC was the only such case, for x86. Other architectures
|  have similar requirements with global variables used during bringup (e.g.
|  `secondary_data` on arm/arm64), so architectures must opt-in for now.

... so that we have a specific example of how unconditionally enabling this for
all architectures would definitely break things today.

FWIW, that's something I would like to cleanup for arm64 for general
robustness, and if that would make it possible for us to have parallel bringup
in future that would be a nice bonus.

> Note that the new parallel stages do *not* yet bring each AP to the
> CPUHP_BRINGUP_CPU state. The final loop in bringup_nonboot_cpus() is
> untouched, bringing each AP in turn from the final PARALLEL_DYN state
> (or all the way from CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then
> waiting for that AP to do its own processing and reach CPUHP_ONLINE
> before releasing the next. Parallelising that part by bringing them all
> to CPUHP_BRINGUP_CPU and then waiting for them all is an exercise for
> the future.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  include/linux/cpuhotplug.h |  2 ++
>  kernel/cpu.c               | 27 +++++++++++++++++++++++++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 773c83730906..45c327538321 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -131,6 +131,8 @@ enum cpuhp_state {
>  	CPUHP_MIPS_SOC_PREPARE,
>  	CPUHP_BP_PREPARE_DYN,
>  	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
> +	CPUHP_BP_PARALLEL_DYN,
> +	CPUHP_BP_PARALLEL_DYN_END		= CPUHP_BP_PARALLEL_DYN + 4,
>  	CPUHP_BRINGUP_CPU,
>  
>  	/*
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 192e43a87407..1a46eb57d8f7 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1462,6 +1462,24 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu)
>  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
>  {
>  	unsigned int cpu;
> +	int n = setup_max_cpus - num_online_cpus();
> +
> +	/* ∀ parallel pre-bringup state, bring N CPUs to it */

I see you have a fancy maths keyboard. ;)

It might be worth using a few more words here for clarity, e.g.

	/*
	 * Bring all nonboot CPUs through each pre-bringup state in turn
	 */

Thanks,
Mark.

> +	if (n > 0) {
> +		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
> +
> +		while (st <= CPUHP_BP_PARALLEL_DYN_END &&
> +		       cpuhp_hp_states[st].name) {
> +			int i = n;
> +
> +			for_each_present_cpu(cpu) {
> +				cpu_up(cpu, st);
> +				if (!--i)
> +					break;
> +			}
> +			st++;
> +		}
> +	}
>  
>  	for_each_present_cpu(cpu) {
>  		if (num_online_cpus() >= setup_max_cpus)
> @@ -1829,6 +1847,10 @@ static int cpuhp_reserve_state(enum cpuhp_state state)
>  		step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN;
>  		end = CPUHP_BP_PREPARE_DYN_END;
>  		break;
> +	case CPUHP_BP_PARALLEL_DYN:
> +		step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN;
> +		end = CPUHP_BP_PARALLEL_DYN_END;
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1853,14 +1875,15 @@ static int cpuhp_store_callbacks(enum cpuhp_state state, const char *name,
>  	/*
>  	 * If name is NULL, then the state gets removed.
>  	 *
> -	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on
> +	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_P*_DYN are handed out on
>  	 * the first allocation from these dynamic ranges, so the removal
>  	 * would trigger a new allocation and clear the wrong (already
>  	 * empty) state, leaving the callbacks of the to be cleared state
>  	 * dangling, which causes wreckage on the next hotplug operation.
>  	 */
>  	if (name && (state == CPUHP_AP_ONLINE_DYN ||
> -		     state == CPUHP_BP_PREPARE_DYN)) {
> +		     state == CPUHP_BP_PREPARE_DYN ||
> +		     state == CPUHP_BP_PARALLEL_DYN)) {
>  		ret = cpuhp_reserve_state(state);
>  		if (ret < 0)
>  			return ret;
> -- 
> 2.31.1
>
David Woodhouse Dec. 14, 2021, 8:32 p.m. UTC | #2
On Tue, 2021-12-14 at 14:24 +0000, Mark Rutland wrote:
> On Tue, Dec 14, 2021 at 12:32:46PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > 
> > If the platform registers these states, bring all CPUs to each registered
> > state in turn, before the final bringup to CPUHP_BRINGUP_CPU. This allows
> > the architecture to parallelise the slow asynchronous tasks like sending
> > INIT/SIPI and waiting for the AP to come to life.
> > 
> > There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
> > this means that *all* CPUs are brought through the prepare states and to
> > CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
> > and then are allowed to run for themselves to CPUHP_ONLINE.
> > 
> > So any combination of prepare/start calls which depend on A-B ordering
> > for each CPU in turn, such as the X2APIC code which used to allocate a
> > cluster mask 'just in case' and store it in a global variable in the
> > prep stage, then potentially consume that preallocated structure from
> > the AP and set the global pointer to NULL to be reallocated in
> > CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.
> > 
> > We believe that X2APIC was the only such case, for x86. But this is why
> > it remains an architecture opt-in. For now.
> 
> It might be worth elaborating with a non-x86 example, e.g.
> 
> >  We believe that X2APIC was the only such case, for x86. Other architectures
> >  have similar requirements with global variables used during bringup (e.g.
> >  `secondary_data` on arm/arm64), so architectures must opt-in for now.
> 
> ... so that we have a specific example of how unconditionally enabling this for
> all architectures would definitely break things today.

I do not have such an example, and I do not know that it would
definitely break things to turn it on for all architectures today.

The x2apic one is an example of why it *might* break random
architectures and thus why it needs to be an architecture opt-in.

> FWIW, that's something I would like to cleanup for arm64 for general
> robustness, and if that would make it possible for us to have parallel bringup
> in future that would be a nice bonus.

Yes. But although I lay the groundwork here, the arch can't *actually*
do parallel bringup without some arch-specific work, so auditing the
pre-bringup states is the easy part. :)
Mark Rutland Dec. 15, 2021, 11:10 a.m. UTC | #3
On Tue, Dec 14, 2021 at 08:32:29PM +0000, David Woodhouse wrote:
> On Tue, 2021-12-14 at 14:24 +0000, Mark Rutland wrote:
> > On Tue, Dec 14, 2021 at 12:32:46PM +0000, David Woodhouse wrote:
> > > From: David Woodhouse <
> > > dwmw@amazon.co.uk
> > > >
> > > 
> > > If the platform registers these states, bring all CPUs to each registered
> > > state in turn, before the final bringup to CPUHP_BRINGUP_CPU. This allows
> > > the architecture to parallelise the slow asynchronous tasks like sending
> > > INIT/SIPI and waiting for the AP to come to life.
> > > 
> > > There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
> > > this means that *all* CPUs are brought through the prepare states and to
> > > CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
> > > and then are allowed to run for themselves to CPUHP_ONLINE.
> > > 
> > > So any combination of prepare/start calls which depend on A-B ordering
> > > for each CPU in turn, such as the X2APIC code which used to allocate a
> > > cluster mask 'just in case' and store it in a global variable in the
> > > prep stage, then potentially consume that preallocated structure from
> > > the AP and set the global pointer to NULL to be reallocated in
> > > CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.
> > > 
> > > We believe that X2APIC was the only such case, for x86. But this is why
> > > it remains an architecture opt-in. For now.
> > 
> > It might be worth elaborating with a non-x86 example, e.g.
> > 
> > >  We believe that X2APIC was the only such case, for x86. Other architectures
> > >  have similar requirements with global variables used during bringup (e.g.
> > >  `secondary_data` on arm/arm64), so architectures must opt-in for now.
> > 
> > ... so that we have a specific example of how unconditionally enabling this for
> > all architectures would definitely break things today.
> 
> I do not have such an example, and I do not know that it would
> definitely break things to turn it on for all architectures today.
> 
> The x2apic one is an example of why it *might* break random
> architectures and thus why it needs to be an architecture opt-in.

Ah; I had thought we did the `secondary_data` setup in a PREPARE step, and
hence it was a comparable example, but I was mistaken. Sorry for the noise!

> > FWIW, that's something I would like to cleanup for arm64 for general
> > robustness, and if that would make it possible for us to have parallel bringup
> > in future that would be a nice bonus.
> 
> Yes. But although I lay the groundwork here, the arch can't *actually*
> do parallel bringup without some arch-specific work, so auditing the
> pre-bringup states is the easy part. :)

Sure; that was trying to be a combination of:

* This looks nice, I'd like to use this (eventually) on arm64.

* I'm aware of some arm64-specific groundwork we need to do before arm64 can
  use this.

So I think we're agreed. :)

Thanks,
Mark.
David Woodhouse Dec. 15, 2021, 3:16 p.m. UTC | #4
On Wed, 2021-12-15 at 11:10 +0000, Mark Rutland wrote:
> On Tue, Dec 14, 2021 at 08:32:29PM +0000, David Woodhouse wrote:
> > On Tue, 2021-12-14 at 14:24 +0000, Mark Rutland wrote:
> > > On Tue, Dec 14, 2021 at 12:32:46PM +0000, David Woodhouse wrote:
> > > > From: David Woodhouse <
> > > > dwmw@amazon.co.uk
> > > > 
> > > > 
> > > > If the platform registers these states, bring all CPUs to each registered
> > > > state in turn, before the final bringup to CPUHP_BRINGUP_CPU. This allows
> > > > the architecture to parallelise the slow asynchronous tasks like sending
> > > > INIT/SIPI and waiting for the AP to come to life.
> > > > 
> > > > There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
> > > > this means that *all* CPUs are brought through the prepare states and to
> > > > CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
> > > > and then are allowed to run for themselves to CPUHP_ONLINE.
> > > > 
> > > > So any combination of prepare/start calls which depend on A-B ordering
> > > > for each CPU in turn, such as the X2APIC code which used to allocate a
> > > > cluster mask 'just in case' and store it in a global variable in the
> > > > prep stage, then potentially consume that preallocated structure from
> > > > the AP and set the global pointer to NULL to be reallocated in
> > > > CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.
> > > > 
> > > > We believe that X2APIC was the only such case, for x86. But this is why
> > > > it remains an architecture opt-in. For now.
> > > 
> > > It might be worth elaborating with a non-x86 example, e.g.
> > > 
> > > >  We believe that X2APIC was the only such case, for x86. Other architectures
> > > >  have similar requirements with global variables used during bringup (e.g.
> > > >  `secondary_data` on arm/arm64), so architectures must opt-in for now.
> > > 
> > > ... so that we have a specific example of how unconditionally enabling this for
> > > all architectures would definitely break things today.
> > 
> > I do not have such an example, and I do not know that it would
> > definitely break things to turn it on for all architectures today.
> > 
> > The x2apic one is an example of why it *might* break random
> > architectures and thus why it needs to be an architecture opt-in.
> 
> Ah; I had thought we did the `secondary_data` setup in a PREPARE step, and
> hence it was a comparable example, but I was mistaken. Sorry for the noise!
> 

Right, that's entirely within your __cpu_up(). You can stare at
Thomas's patch for inspiration on how to cope with that one.

In arch/arm64/kernel/smp.c you have a comment saying

 * as from 2.5, kernels no longer have an init_tasks structure
 * so we need some other way of telling a new secondary core
 * where to place its SVC stack

In x86, the idle task pointer is in the per_cpu data. The real mode
bringup now starts with the CPU's APICID (which it can get from CPUID),
looks that up in the cpuid_to_apicid[] array to find the CPU#, then
finds its own per_cpu data, and gets everything else it needs
(including the initial stack) from there.

> > > FWIW, that's something I would like to cleanup for arm64 for general
> > > robustness, and if that would make it possible for us to have parallel bringup
> > > in future that would be a nice bonus.
> > 
> > Yes. But although I lay the groundwork here, the arch can't *actually*
> > do parallel bringup without some arch-specific work, so auditing the
> > pre-bringup states is the easy part. :)
> 
> Sure; that was trying to be a combination of:
> 
> * This looks nice, I'd like to use this (eventually) on arm64.
> 
> * I'm aware of some arm64-specific groundwork we need to do before arm64 can
>   use this.
> 
> So I think we're agreed. :)

I'd love to have at least one more architecture come along for the ride
as I do the next step. After this series, the largest chunk of time
seems to be spent waiting for each AP as they transition to
CPUHP_AP_ONLINE_IDLE and then all the way to CPUHP_ONLINE.

So I'm going to look at making bringup_nonboot_cpus() prod *all* the
APs to move to CPUHP_AP_ONLINE_IDLE without waiting for them to get
there. Then do another pass waiting for that and prodding them to move
to CPUHP_ONLINE. And then do a final pass of waiting for them to have
got *there*.


> > +     int n = setup_max_cpus - num_online_cpus();
> > +
> > +     /* ∀ parallel pre-bringup state, bring N CPUs to it */
> 
> I see you have a fancy maths keyboard. ;)

Nah, standard UK layout keyboard. I just happen to remember U+2200 as
it's *right* at the beginning of the mathematical symbols block and is
fairly easy to type ;)

> It might be worth using a few more words here for clarity, e.g.
> 
>        /*
>         * Bring all nonboot CPUs through each pre-bringup state in turn
>         */

But it isn't *all* nonboot CPUs; it really is only up to N of them.
diff mbox series

Patch

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 773c83730906..45c327538321 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -131,6 +131,8 @@  enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	CPUHP_BP_PARALLEL_DYN,
+	CPUHP_BP_PARALLEL_DYN_END		= CPUHP_BP_PARALLEL_DYN + 4,
 	CPUHP_BRINGUP_CPU,
 
 	/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 192e43a87407..1a46eb57d8f7 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1462,6 +1462,24 @@  int bringup_hibernate_cpu(unsigned int sleep_cpu)
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
 	unsigned int cpu;
+	int n = setup_max_cpus - num_online_cpus();
+
+	/* ∀ parallel pre-bringup state, bring N CPUs to it */
+	if (n > 0) {
+		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
+
+		while (st <= CPUHP_BP_PARALLEL_DYN_END &&
+		       cpuhp_hp_states[st].name) {
+			int i = n;
+
+			for_each_present_cpu(cpu) {
+				cpu_up(cpu, st);
+				if (!--i)
+					break;
+			}
+			st++;
+		}
+	}
 
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
@@ -1829,6 +1847,10 @@  static int cpuhp_reserve_state(enum cpuhp_state state)
 		step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN;
 		end = CPUHP_BP_PREPARE_DYN_END;
 		break;
+	case CPUHP_BP_PARALLEL_DYN:
+		step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN;
+		end = CPUHP_BP_PARALLEL_DYN_END;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1853,14 +1875,15 @@  static int cpuhp_store_callbacks(enum cpuhp_state state, const char *name,
 	/*
 	 * If name is NULL, then the state gets removed.
 	 *
-	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on
+	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_P*_DYN are handed out on
 	 * the first allocation from these dynamic ranges, so the removal
 	 * would trigger a new allocation and clear the wrong (already
 	 * empty) state, leaving the callbacks of the to be cleared state
 	 * dangling, which causes wreckage on the next hotplug operation.
 	 */
 	if (name && (state == CPUHP_AP_ONLINE_DYN ||
-		     state == CPUHP_BP_PREPARE_DYN)) {
+		     state == CPUHP_BP_PREPARE_DYN ||
+		     state == CPUHP_BP_PARALLEL_DYN)) {
 		ret = cpuhp_reserve_state(state);
 		if (ret < 0)
 			return ret;