diff mbox series

[v17,6/8] x86/smpboot: Send INIT/SIPI/SIPI to secondary CPUs in parallel

Message ID 20230328195758.1049469-7-usama.arif@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Parallel CPU bringup for x86_64 | expand

Commit Message

Usama Arif March 28, 2023, 7:57 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

When the APs can find their own APIC ID without assistance, perform the
AP bringup in parallel.

Register a CPUHP_BP_PARALLEL_STARTUP stage "x86/cpu:kick" which just
calls do_boot_cpu() to deliver INIT/SIPI/SIPI to each AP in turn before
the normal native_cpu_up() does the rest of the hand-holding.

The APs will then take turns through the real mode code (which has its
own bitlock for exclusion) until they make it to their own stack, then
proceed through the first few lines of start_secondary() and execute
these parts in parallel:

 start_secondary()
    -> cr4_init()
    -> (some 32-bit only stuff so not in the parallel cases)
    -> cpu_init_secondary()
       -> cpu_init_exception_handling()
       -> cpu_init()
          -> wait_for_master_cpu()

At this point they wait for the BSP to set their bit in cpu_callout_mask
(from do_wait_cpu_initialized()), and release them to continue through
the rest of cpu_init() and beyond.

This reduces the time taken for bringup on my 28-thread Haswell system
from about 120ms to 80ms. On a socket 96-thread Skylake it takes the
bringup time from 500ms to 100ms.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---
 arch/x86/kernel/smpboot.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Comments

Thomas Gleixner March 30, 2023, 4:46 p.m. UTC | #1
On Tue, Mar 28 2023 at 20:57, Usama Arif wrote:
> The APs will then take turns through the real mode code (which has its
> own bitlock for exclusion) until they make it to their own stack, then
> proceed through the first few lines of start_secondary() and execute
> these parts in parallel:
>
>  start_secondary()
>     -> cr4_init()
>     -> (some 32-bit only stuff so not in the parallel cases)
>     -> cpu_init_secondary()
>        -> cpu_init_exception_handling()
>        -> cpu_init()
>           -> wait_for_master_cpu()
>
> At this point they wait for the BSP to set their bit in cpu_callout_mask
> (from do_wait_cpu_initialized()), and release them to continue through
> the rest of cpu_init() and beyond.

That's actually broken on SMT enabled machines when microcode needs to
be updated.

Lets look at a 2 core, 4 thread system, where CPU0/2 and CPU1/3 are the
sibling pairs.

CPU 0:                         	CPU1		CPU2		CPU3

for_each_present_cpu(cpu)
    cpu_up(cpu, KICK_AP_ALIVE);
                                startup()       
                                wait()
                                
			                        startup()
                                                wait()

Release CPU1
                                load_ucode()                    startup()
                                                                wait()

So that violates the rules of microcode loading that the sibling must be
in a state where it does not execute anything which might be affected by
the microcode update. The fragile startup code does not really qualify
as such a state :)

Thanks,

        tglx
Borislav Petkov March 30, 2023, 5:05 p.m. UTC | #2
On March 30, 2023 6:46:24 PM GMT+02:00, Thomas Gleixner <tglx@linutronix.de> wrote:
>So that violates the rules of microcode loading that the sibling must be
>in a state where it does not execute anything which might be affected by
>the microcode update. The fragile startup code does not really qualify
>as such a state :)

Yeah I don't think we ever enforced this for early loading. The thread sync thing came with the late loading dance....
Thomas Gleixner March 30, 2023, 6:17 p.m. UTC | #3
On Thu, Mar 30 2023 at 19:05, Borislav Petkov wrote:

> On March 30, 2023 6:46:24 PM GMT+02:00, Thomas Gleixner <tglx@linutronix.de> wrote:
>>So that violates the rules of microcode loading that the sibling must be
>>in a state where it does not execute anything which might be affected by
>>the microcode update. The fragile startup code does not really qualify
>>as such a state :)
>
> Yeah I don't think we ever enforced this for early loading.

We don't have to so far. CPU bringup is fully serialized so when the
first sibling comes up the other one is still in wait for SIPI lala
land. When the second comes up it will see that the microcode is already
up to date.

Thanks,

        tglx
Usama Arif March 31, 2023, 11:40 a.m. UTC | #4
On 30/03/2023 19:17, Thomas Gleixner wrote:
> On Thu, Mar 30 2023 at 19:05, Borislav Petkov wrote:
> 
>> On March 30, 2023 6:46:24 PM GMT+02:00, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> So that violates the rules of microcode loading that the sibling must be
>>> in a state where it does not execute anything which might be affected by
>>> the microcode update. The fragile startup code does not really qualify
>>> as such a state :)
>>
>> Yeah I don't think we ever enforced this for early loading.
> 
> We don't have to so far. CPU bringup is fully serialized so when the
> first sibling comes up the other one is still in wait for SIPI lala
> land. When the second comes up it will see that the microcode is already
> up to date.
> 

A simple solution is to serialize load_ucode_ap by acquiring a spinlock 
at the start of ucode_cpu_init and releasing it at its end.

I guess if we had topology_sibling_cpumask initialized at this point we 
could have a spinlock per core (not thread) and parallelize it, but 
thats set much later in smp_callin.

I can include the below in next version if it makes sense?

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 80a688295ffa..b5e64628a975 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2155,10 +2155,13 @@ static inline void setup_getcpu(int cpu)
  }

  #ifdef CONFIG_X86_64
+static DEFINE_SPINLOCK(ucode_cpu_spinlock);
  static inline void ucode_cpu_init(int cpu)
  {
+       spin_lock(&ucode_cpu_spinlock);
         if (cpu)
                 load_ucode_ap();
+       spin_unlock(&ucode_cpu_spinlock);
  }
diff mbox series

Patch

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 45f3d08321fe..0003f5e1740c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -57,6 +57,7 @@ 
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
 #include <linux/stackprotector.h>
+#include <linux/smpboot.h>
 
 #include <asm/acpi.h>
 #include <asm/cacheinfo.h>
@@ -993,7 +994,8 @@  static void announce_cpu(int cpu, int apicid)
 		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
 
 	if (cpu == 1)
-		printk(KERN_INFO "x86: Booting SMP configuration:\n");
+		printk(KERN_INFO "x86: Booting SMP configuration in %s:\n",
+		       do_parallel_bringup ? "parallel" : "series");
 
 	if (system_state < SYSTEM_RUNNING) {
 		if (node != current_node) {
@@ -1326,9 +1328,12 @@  int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 {
 	int ret;
 
-	ret = do_cpu_up(cpu, tidle);
-	if (ret)
-		goto out;
+	/* If parallel AP bringup isn't enabled, perform the first steps now. */
+	if (!do_parallel_bringup) {
+		ret = do_cpu_up(cpu, tidle);
+		if (ret)
+			goto out;
+	}
 
 	ret = do_wait_cpu_initialized(cpu);
 	if (ret)
@@ -1348,6 +1353,12 @@  int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 	return ret;
 }
 
+/* Bringup step one: Send INIT/SIPI to the target AP */
+static int native_cpu_kick(unsigned int cpu)
+{
+	return do_cpu_up(cpu, idle_thread_get(cpu));
+}
+
 /**
  * arch_disable_smp_support() - disables SMP support for x86 at runtime
  */
@@ -1516,6 +1527,8 @@  static bool prepare_parallel_bringup(void)
 		smpboot_control = STARTUP_APICID_CPUID_01;
 	}
 
+	cpuhp_setup_state_nocalls(CPUHP_BP_PARALLEL_STARTUP, "x86/cpu:kick",
+				  native_cpu_kick, NULL);
 	return true;
 }