diff mbox series

[V2,38/38] x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it

Message ID 20230504185938.393373946@linutronix.de (mailing list archive)
State Superseded
Headers show
Series cpu/hotplug, x86: Reworked parallel CPU bringup | expand

Checks

Context Check Description
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next at HEAD c2d3c8441e3d
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 5 and now 5
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 8 this patch: 8
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 8 this patch: 8
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 5 this patch: 5
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning WARNING: 'seperately' may be misspelled - perhaps 'separately'? WARNING: Non-standard signature: Originally-by:
conchuod/source_inline warning Was 1 now: 1
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Thomas Gleixner May 4, 2023, 7:03 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

Implement the validation function which tells the core code whether
parallel bringup is possible.

The only condition for now is that the kernel does not run in an encrypted
guest as these will trap the RDMSR via #VC, which cannot be handled at that
point in early startup.

There was an earlier variant for AMD-SEV which used the GHBC protocol for
retrieving the APIC ID via CPUID, but there is no guarantee that the
initial APIC ID in CPUID is the same as the real APIC ID. There is no
enforcement from the secure firmware and the hypervisor can assign APIC IDs
as it sees fit as long as the ACPI/MADT table is consistent with that
assignment.

Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling
AMD-SEV guests for parallel startup needs some more thought.

Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside
the scope of this change.

Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of
CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart.

[ mikelley: Reported the announce_cpu() fallout

Originally-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Fixup announce_cpu() - Michael Kelley
---
 arch/x86/Kconfig             |    3 +
 arch/x86/kernel/cpu/common.c |    6 ---
 arch/x86/kernel/smpboot.c    |   83 ++++++++++++++++++++++++++++++++++++-------
 3 files changed, 73 insertions(+), 19 deletions(-)
---

Comments

Michael Kelley (LINUX) May 6, 2023, 12:53 a.m. UTC | #1
From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, May 4, 2023 12:03 PM
> 
> Implement the validation function which tells the core code whether
> parallel bringup is possible.
> 
> The only condition for now is that the kernel does not run in an encrypted
> guest as these will trap the RDMSR via #VC, which cannot be handled at that
> point in early startup.
> 
> There was an earlier variant for AMD-SEV which used the GHBC protocol for
> retrieving the APIC ID via CPUID, but there is no guarantee that the
> initial APIC ID in CPUID is the same as the real APIC ID. There is no
> enforcement from the secure firmware and the hypervisor can assign APIC IDs
> as it sees fit as long as the ACPI/MADT table is consistent with that
> assignment.
> 
> Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling
> AMD-SEV guests for parallel startup needs some more thought.
> 
> Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside
> the scope of this change.
> 
> Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of
> CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart.
> 
> [ mikelley: Reported the announce_cpu() fallout
> 
> Originally-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Fixup announce_cpu() - Michael Kelley
> ---
>  arch/x86/Kconfig             |    3 +
>  arch/x86/kernel/cpu/common.c |    6 ---
>  arch/x86/kernel/smpboot.c    |   83 ++++++++++++++++++++++++++++++++++++----
> ---

[snip]

> @@ -934,10 +961,10 @@ static void announce_cpu(int cpu, int ap
>  	if (!node_width)
>  		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> 
> -	if (cpu == 1)
> -		printk(KERN_INFO "x86: Booting SMP configuration:\n");
> -
>  	if (system_state < SYSTEM_RUNNING) {
> +		if (num_online_cpus() == 1)

Unfortunately, this new check doesn't work.  Here's the output I get:

[    0.721384] smp: Bringing up secondary CPUs ...
[    0.725359] smpboot: x86: Booting SMP configuration:
[    0.729249] .... node  #0, CPUs:        #2
[    0.729654] smpboot: x86: Booting SMP configuration:
[    0.737247]       #4
[    0.737511] smpboot: x86: Booting SMP configuration:
[    0.741246]       #6
[    0.741508] smpboot: x86: Booting SMP configuration:
[    0.745248]       #8
[    0.745507] smpboot: x86: Booting SMP configuration:
[    0.749250]      #10
[    0.749514] smpboot: x86: Booting SMP configuration:
[    0.753248]      #12
[    0.753492] smpboot: x86: Booting SMP configuration:
[    0.757249]      #14  #1  #3  #5  #7  #9 #11 #13 #15
[    0.769317] smp: Brought up 1 node, 16 CPUs
[    0.773246] smpboot: Max logical packages: 1
[    0.777257] smpboot: Total of 16 processors activated (78253.79 BogoMIPS)

Evidently num_online_cpus() isn't updated until after all the primary
siblings get started.

When booting with cpuhp.parallel=0, the output is good.

Michael

> +			pr_info("x86: Booting SMP configuration:\n");
> +
>  		if (node != current_node) {
>  			if (current_node > (-1))
>  				pr_cont("\n");
> @@ -948,7 +975,7 @@ static void announce_cpu(int cpu, int ap
>  		}
> 
>  		/* Add padding for the BSP */
> -		if (cpu == 1)
> +		if (num_online_cpus() == 1)
>  			pr_cont("%*s", width + 1, " ");
> 
>  		pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
Thomas Gleixner May 6, 2023, 4:22 p.m. UTC | #2
On Sat, May 06 2023 at 00:53, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, May 4, 2023 12:03 PM
> [snip]
>
>> @@ -934,10 +961,10 @@ static void announce_cpu(int cpu, int ap
>>  	if (!node_width)
>>  		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
>> 
>> -	if (cpu == 1)
>> -		printk(KERN_INFO "x86: Booting SMP configuration:\n");
>> -
>>  	if (system_state < SYSTEM_RUNNING) {
>> +		if (num_online_cpus() == 1)
>
> Unfortunately, this new check doesn't work.  Here's the output I get:
>
> [    0.721384] smp: Bringing up secondary CPUs ...
> [    0.725359] smpboot: x86: Booting SMP configuration:
> [    0.729249] .... node  #0, CPUs:        #2
> [    0.729654] smpboot: x86: Booting SMP configuration:
> [    0.737247]       #4
>
> Evidently num_online_cpus() isn't updated until after all the primary
> siblings get started.

Duh. Where is that brown paperbag?

> When booting with cpuhp.parallel=0, the output is good.

Exactly that was on the command line when I quickly booted that change :(

The below should fix it for real.

Thanks,

        tglx
---
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -951,9 +951,9 @@ static int wakeup_secondary_cpu_via_init
 /* reduce the number of lines printed when booting a large cpu count system */
 static void announce_cpu(int cpu, int apicid)
 {
+	static int width, node_width, first = 1;
 	static int current_node = NUMA_NO_NODE;
 	int node = early_cpu_to_node(cpu);
-	static int width, node_width;
 
 	if (!width)
 		width = num_digits(num_possible_cpus()) + 1; /* + '#' sign */
@@ -962,7 +962,7 @@ static void announce_cpu(int cpu, int ap
 		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
 
 	if (system_state < SYSTEM_RUNNING) {
-		if (num_online_cpus() == 1)
+		if (first)
 			pr_info("x86: Booting SMP configuration:\n");
 
 		if (node != current_node) {
@@ -975,11 +975,11 @@ static void announce_cpu(int cpu, int ap
 		}
 
 		/* Add padding for the BSP */
-		if (num_online_cpus() == 1)
+		if (first)
 			pr_cont("%*s", width + 1, " ");
+		first = 0;
 
 		pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
-
 	} else
 		pr_info("Booting Node %d Processor %d APIC 0x%x\n",
 			node, cpu, apicid);
Michael Kelley (LINUX) May 7, 2023, 4:14 a.m. UTC | #3
From: Thomas Gleixner <tglx@linutronix.de> Sent: Saturday, May 6, 2023 9:23 AM
> 
> On Sat, May 06 2023 at 00:53, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, May 4, 2023 12:03 PM
> > [snip]
> >
> >> @@ -934,10 +961,10 @@ static void announce_cpu(int cpu, int ap
> >>  	if (!node_width)
> >>  		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> >>
> >> -	if (cpu == 1)
> >> -		printk(KERN_INFO "x86: Booting SMP configuration:\n");
> >> -
> >>  	if (system_state < SYSTEM_RUNNING) {
> >> +		if (num_online_cpus() == 1)
> >
> > Unfortunately, this new check doesn't work.  Here's the output I get:
> >
> > [    0.721384] smp: Bringing up secondary CPUs ...
> > [    0.725359] smpboot: x86: Booting SMP configuration:
> > [    0.729249] .... node  #0, CPUs:        #2
> > [    0.729654] smpboot: x86: Booting SMP configuration:
> > [    0.737247]       #4
> >
> > Evidently num_online_cpus() isn't updated until after all the primary
> > siblings get started.
> 
> Duh. Where is that brown paperbag?
> 
> > When booting with cpuhp.parallel=0, the output is good.
> 
> Exactly that was on the command line when I quickly booted that change :(
> 
> The below should fix it for real.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -951,9 +951,9 @@ static int wakeup_secondary_cpu_via_init
>  /* reduce the number of lines printed when booting a large cpu count system */
>  static void announce_cpu(int cpu, int apicid)
>  {
> +	static int width, node_width, first = 1;
>  	static int current_node = NUMA_NO_NODE;
>  	int node = early_cpu_to_node(cpu);
> -	static int width, node_width;
> 
>  	if (!width)
>  		width = num_digits(num_possible_cpus()) + 1; /* + '#' sign */
> @@ -962,7 +962,7 @@ static void announce_cpu(int cpu, int ap
>  		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> 
>  	if (system_state < SYSTEM_RUNNING) {
> -		if (num_online_cpus() == 1)
> +		if (first)
>  			pr_info("x86: Booting SMP configuration:\n");
> 
>  		if (node != current_node) {
> @@ -975,11 +975,11 @@ static void announce_cpu(int cpu, int ap
>  		}
> 
>  		/* Add padding for the BSP */
> -		if (num_online_cpus() == 1)
> +		if (first)
>  			pr_cont("%*s", width + 1, " ");
> +		first = 0;
> 
>  		pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
> -
>  	} else
>  		pr_info("Booting Node %d Processor %d APIC 0x%x\n",
>  			node, cpu, apicid);

This works.  dmesg output is clean for these guest VM combinations
on Hyper-V that I tested:

* Normal VM:  16 vCPUs in 1 NUMA node and 32 vCPUs in 2 NUMA nodes
* Same configs for a SEV-SNP Confidential VM with paravisor

Tested with and without cpuhp.parallel=0

For the entire series:
Tested-by: Michael Kelley <mikelley@microsoft.com>
diff mbox series

Patch

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,8 +274,9 @@  config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
+	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
-	select HOTPLUG_SPLIT_STARTUP		if SMP
+	select HOTPLUG_SPLIT_STARTUP		if SMP && X86_32
 	select IRQ_FORCED_THREADING
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2128,11 +2128,7 @@  static inline void setup_getcpu(int cpu)
 }
 
 #ifdef CONFIG_X86_64
-static inline void ucode_cpu_init(int cpu)
-{
-	if (cpu)
-		load_ucode_ap();
-}
+static inline void ucode_cpu_init(int cpu) { }
 
 static inline void tss_setup_ist(struct tss_struct *tss)
 {
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -58,6 +58,7 @@ 
 #include <linux/overflow.h>
 #include <linux/stackprotector.h>
 #include <linux/cpuhotplug.h>
+#include <linux/mc146818rtc.h>
 
 #include <asm/acpi.h>
 #include <asm/cacheinfo.h>
@@ -75,7 +76,7 @@ 
 #include <asm/fpu/api.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
-#include <linux/mc146818rtc.h>
+#include <asm/microcode.h>
 #include <asm/i8259.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
@@ -128,7 +129,6 @@  int arch_update_cpu_topology(void)
 	return retval;
 }
 
-
 static unsigned int smpboot_warm_reset_vector_count;
 
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
@@ -229,16 +229,43 @@  static void notrace start_secondary(void
 	 */
 	cr4_init();
 
-#ifdef CONFIG_X86_32
-	/* switch away from the initial page table */
-	load_cr3(swapper_pg_dir);
-	__flush_tlb_all();
-#endif
+	/*
+	 * 32-bit specific. 64-bit reaches this code with the correct page
+	 * table established. Yet another historical divergence.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32)) {
+		/* switch away from the initial page table */
+		load_cr3(swapper_pg_dir);
+		__flush_tlb_all();
+	}
+
 	cpu_init_exception_handling();
 
 	/*
-	 * Synchronization point with the hotplug core. Sets the
-	 * synchronization state to ALIVE and waits for the control CPU to
+	 * 32-bit systems load the microcode from the ASM startup code for
+	 * historical reasons.
+	 *
+	 * On 64-bit systems load it before reaching the AP alive
+	 * synchronization point below so it is not part of the full per
+	 * CPU serialized bringup part when "parallel" bringup is enabled.
+	 *
+	 * That's even safe when hyperthreading is enabled in the CPU as
+	 * the core code starts the primary threads first and leaves the
+	 * secondary threads waiting for SIPI. Loading microcode on
+	 * physical cores concurrently is a safe operation.
+	 *
+	 * This covers both the Intel specific issue that concurrent
+	 * microcode loading on SMT siblings must be prohibited and the
+	 * vendor independent issue`that microcode loading which changes
+	 * CPUID, MSRs etc. must be strictly serialized to maintain
+	 * software state correctness.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64))
+		load_ucode_ap();
+
+	/*
+	 * Synchronization point with the hotplug core. Sets this CPUs
+	 * synchronization state to ALIVE and spin-waits for the control CPU to
 	 * release this CPU for further bringup.
 	 */
 	cpuhp_ap_sync_alive();
@@ -934,10 +961,10 @@  static void announce_cpu(int cpu, int ap
 	if (!node_width)
 		node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
 
-	if (cpu == 1)
-		printk(KERN_INFO "x86: Booting SMP configuration:\n");
-
 	if (system_state < SYSTEM_RUNNING) {
+		if (num_online_cpus() == 1)
+			pr_info("x86: Booting SMP configuration:\n");
+
 		if (node != current_node) {
 			if (current_node > (-1))
 				pr_cont("\n");
@@ -948,7 +975,7 @@  static void announce_cpu(int cpu, int ap
 		}
 
 		/* Add padding for the BSP */
-		if (cpu == 1)
+		if (num_online_cpus() == 1)
 			pr_cont("%*s", width + 1, " ");
 
 		pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
@@ -1242,6 +1269,36 @@  void __init smp_prepare_cpus_common(void
 	set_cpu_sibling_map(0);
 }
 
+#ifdef CONFIG_X86_64
+/* Establish whether parallel bringup can be supported. */
+bool __init arch_cpuhp_init_parallel_bringup(void)
+{
+	/*
+	 * Encrypted guests require special handling. They enforce X2APIC
+	 * mode but the RDMSR to read the APIC ID is intercepted and raises
+	 * #VC or #VE which cannot be handled in the early startup code.
+	 *
+	 * AMD-SEV does not provide a RDMSR GHCB protocol so the early
+	 * startup code cannot directly communicate with the secure
+	 * firmware. The alternative solution to retrieve the APIC ID via
+	 * CPUID(0xb), which is covered by the GHCB protocol, is not viable
+	 * either because there is no enforcement of the CPUID(0xb)
+	 * provided "initial" APIC ID to be the same as the real APIC ID.
+	 *
+	 * Intel-TDX has a secure RDMSR hypercall, but that needs to be
+	 * implemented seperately in the low level startup ASM code.
+	 */
+	if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
+		pr_info("Parallel CPU startup disabled due to guest state encryption\n");
+		return false;
+	}
+
+	smpboot_control = STARTUP_READ_APICID;
+	pr_debug("Parallel CPU startup enabled: 0x%08x\n", smpboot_control);
+	return true;
+}
+#endif
+
 /*
  * Prepare for SMP bootup.
  * @max_cpus: configured maximum number of CPUs, It is a legacy parameter