diff mbox series

[v6,06/11] x86/smpboot: Support parallel startup of secondary CPUs

Message ID 20230202215625.3248306-7-usama.arif@bytedance.com (mailing list archive)
State Superseded
Headers show
Series Parallel CPU bringup for x86_64 | expand

Commit Message

Usama Arif Feb. 2, 2023, 9:56 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

To allow for parallel AP bringup, we need to avoid the use of global
variables for passing information to the APs, as well as preventing them
from all trying to use the same real-mode stack simultaneously.

So, introduce a 'lock' field in struct trampoline_header to use as a
simple bit-spinlock for the real-mode stack. That lock also protects
the global variables initial_gs, initial_stack and early_gdt_descr,
which can now be calculated...

So how do we calculate those addresses? Well, they they can all be found
from the per_cpu data for this CPU. Simples! Except... how does it know
what its CPU# is? OK, we export the cpuid_to_apicid[] array and it can
search it to find its APIC ID in there.

But now you whine at me that it doesn't even know its APIC ID? Well, if
it's a relatively modern CPU then the APIC ID is in CPUID leaf 0x0B so
we can use that. Otherwise... erm... OK, otherwise it can't have parallel
CPU bringup for now. We'll still use a global variable for those CPUs and
bring them up one at a time.

So add a global 'smpboot_control' field which either contains the APIC
ID, or a flag indicating that it can be found in CPUID.

This adds the 'do_parallel_bringup' flag in preparation but doesn't
actually enable parallel bringup yet.

[ dwmw2: Minor tweaks, write a commit message ]
[ seanc: Fix stray override of initial_gs in common_cpu_up() ]
Not-signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 arch/x86/include/asm/realmode.h      |  3 ++
 arch/x86/include/asm/smp.h           |  9 +++-
 arch/x86/kernel/acpi/sleep.c         |  1 +
 arch/x86/kernel/apic/apic.c          |  2 +-
 arch/x86/kernel/head_64.S            | 75 ++++++++++++++++++++++++++++
 arch/x86/kernel/smpboot.c            | 34 +++++++++++--
 arch/x86/realmode/init.c             |  3 ++
 arch/x86/realmode/rm/trampoline_64.S | 14 ++++++
 kernel/smpboot.c                     |  2 +-
 9 files changed, 136 insertions(+), 7 deletions(-)

Comments

David Woodhouse Feb. 2, 2023, 10:30 p.m. UTC | #1
On Thu, 2023-02-02 at 21:56 +0000, Usama Arif wrote:
> @@ -1515,6 +1530,17 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
>  
>         speculative_store_bypass_ht_init();
>  
> +       /*
> +        * We can do 64-bit AP bringup in parallel if the CPU reports its
> +        * APIC ID in CPUID leaf 0x0B. Otherwise it's too hard. And not
> +        * for SEV-ES guests because they can't use CPUID that early.
> +        * Also, some AMD CPUs crash when doing parallel cpu bringup, disable
> +        * it for all AMD CPUs to be on the safe side.
> +        */
> +       if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 0x0B ||
> +           cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
> +               do_parallel_bringup = false;
> +
>         snp_set_wakeup_secondary_cpu();
>  }

Oops, I thought I'd removed those two lines about AMD in the comment
when I removed the actual code that did so. Turns out we remove those
lines again in the *next* patch instead.
Usama Arif Feb. 2, 2023, 10:50 p.m. UTC | #2
On 02/02/2023 22:30, David Woodhouse wrote:
> On Thu, 2023-02-02 at 21:56 +0000, Usama Arif wrote:
>> @@ -1515,6 +1530,17 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
>>   
>>          speculative_store_bypass_ht_init();
>>   
>> +       /*
>> +        * We can do 64-bit AP bringup in parallel if the CPU reports its
>> +        * APIC ID in CPUID leaf 0x0B. Otherwise it's too hard. And not
>> +        * for SEV-ES guests because they can't use CPUID that early.
>> +        * Also, some AMD CPUs crash when doing parallel cpu bringup, disable
>> +        * it for all AMD CPUs to be on the safe side.
>> +        */
>> +       if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 0x0B ||
>> +           cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
>> +               do_parallel_bringup = false;
>> +
>>          snp_set_wakeup_secondary_cpu();
>>   }
> 
> Oops, I thought I'd removed those two lines about AMD in the comment
> when I removed the actual code that did so. Turns out we remove those
> lines again in the *next* patch instead.

Ah, sorry about that, should have caught it while reviewing before 
posting. To think of it, it might be better to squash this and next AMD 
disabling patch, if anyone testing on AMD ever happens to check out at 
this specific patch, their machine might not boot :). Will repost only 
after getting the OK for the rest of the patches along with addressing 
any other comments that come up.

Thanks,
Usama
David Woodhouse Feb. 3, 2023, 8:14 a.m. UTC | #3
On Thu, 2023-02-02 at 22:50 +0000, Usama Arif wrote:
> 
> Ah, sorry about that, should have caught it while reviewing before 
> posting. To think of it, it might be better to squash this and next AMD 
> disabling patch, if anyone testing on AMD ever happens to check out at 
> this specific patch, their machine might not boot :). 
> 

Nah, we don't actually *enable* parallel boot until the next patch
anyway. At this point we're only laying the groundwork. I think it's
best separate.

> Will repost only after getting the OK for the rest of the patches
> along with addressing any other comments that come up.

Ack. And hopefully the missing SoB from both Thomas and Arjan for their
parts. Part of the reasoning for the lack of SoB from Arjan might have
been that we haven't finished doing the thinking about how this works
in all cases, with non-constant TSC and other stuff. Let's make sure we
give that a long hard stare.
Arjan van de Ven Feb. 3, 2023, 2:41 p.m. UTC | #4
On 2/3/2023 12:14 AM, David Woodhouse wrote:
> On Thu, 2023-02-02 at 22:50 +0000, Usama Arif wrote:
>>
>> Ah, sorry about that, should have caught it while reviewing before
>> posting. To think of it, it might be better to squash this and next AMD
>> disabling patch, if anyone testing on AMD ever happens to check out at
>> this specific patch, their machine might not boot :).
>>
> 
> Nah, we don't actually *enable* parallel boot until the next patch
> anyway. At this point we're only laying the groundwork. I think it's
> best separate.
> 
>> Will repost only after getting the OK for the rest of the patches
>> along with addressing any other comments that come up.
> 
> Ack. And hopefully the missing SoB from both Thomas and Arjan for their
> parts. Part of the reasoning for the lack of SoB from Arjan might have
> been that we haven't finished doing the thinking about how this works
> in all cases, with non-constant TSC and other stuff. Let's make sure we
> give that a long hard stare.

most of it is that I need to sit down for some real time and have a think through it ;)

for all the common cases / modern HW it will be ok... but all the rest???

maybe we need a strict(er) safety valve on where this is used and then over time
widen the usage?

(after all, a regression means revert, and maybe it's better to only revert "add CPU type X" than
revert the whole series)
Sean Christopherson Feb. 3, 2023, 6:17 p.m. UTC | #5
On Thu, Feb 02, 2023, Usama Arif wrote:
> @@ -1515,6 +1530,17 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
>  
>  	speculative_store_bypass_ht_init();
>  
> +	/*
> +	 * We can do 64-bit AP bringup in parallel if the CPU reports its
> +	 * APIC ID in CPUID leaf 0x0B. Otherwise it's too hard. And not
> +	 * for SEV-ES guests because they can't use CPUID that early.
> +	 * Also, some AMD CPUs crash when doing parallel cpu bringup, disable
> +	 * it for all AMD CPUs to be on the safe side.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 0x0B ||
> +	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))

Obviously not your fault, but CC_ATTR_GUEST_STATE_ENCRYPT is a terrible name.
TDX guest state is also encrypted, but TDX doesn't return true CC_ATTR_GUEST_STATE_ENCRYPT.
I.e. this looks wrong, but is correct, because CC_ATTR_GUEST_STATE_ENCRYPT really
just means "SEV-ES guest".

> +		do_parallel_bringup = false;
> +
>  	snp_set_wakeup_secondary_cpu();
>  }
Thomas Gleixner Feb. 7, 2023, 12:07 a.m. UTC | #6
On Thu, Feb 02 2023 at 21:56, Usama Arif wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> To allow for parallel AP bringup, we need to avoid the use of global
> variables for passing information to the APs, as well as preventing them
> from all trying to use the same real-mode stack simultaneously.
>
> So, introduce a 'lock' field in struct trampoline_header to use as a
> simple bit-spinlock for the real-mode stack. That lock also protects
> the global variables initial_gs, initial_stack and early_gdt_descr,
> which can now be calculated...
>
> So how do we calculate those addresses? Well, they they can all be found
> from the per_cpu data for this CPU. Simples! Except... how does it know
> what its CPU# is? OK, we export the cpuid_to_apicid[] array and it can
> search it to find its APIC ID in there.
>
> But now you whine at me that it doesn't even know its APIC ID? Well, if
> it's a relatively modern CPU then the APIC ID is in CPUID leaf 0x0B so
> we can use that. Otherwise... erm... OK, otherwise it can't have parallel
> CPU bringup for now. We'll still use a global variable for those CPUs and
> bring them up one at a time.
>
> So add a global 'smpboot_control' field which either contains the APIC
> ID, or a flag indicating that it can be found in CPUID.
>
> This adds the 'do_parallel_bringup' flag in preparation but doesn't
> actually enable parallel bringup yet.
>
> [ dwmw2: Minor tweaks, write a commit message ]
> [ seanc: Fix stray override of initial_gs in common_cpu_up() ]
> Not-signed-off-by: Thomas Gleixner <tglx@linutronix.de>

I'm happy to add my SOB if someone actually writes a coherent changelog
which complies with the documented rules :)
diff mbox series

Patch

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index a336feef0af1..f0357cfe2fb0 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -52,6 +52,7 @@  struct trampoline_header {
 	u64 efer;
 	u32 cr4;
 	u32 flags;
+	u32 lock;
 #endif
 };
 
@@ -65,6 +66,8 @@  extern unsigned long initial_stack;
 extern unsigned long initial_vc_handler;
 #endif
 
+extern u32 *trampoline_lock;
+
 extern unsigned char real_mode_blob[];
 extern unsigned char real_mode_relocs[];
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index b4dbb20dab1a..9b3c0cd35f5f 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -199,5 +199,12 @@  extern void nmi_selftest(void);
 #define nmi_selftest() do { } while (0)
 #endif
 
-#endif /* __ASSEMBLY__ */
+extern unsigned int smpboot_control;
+
+#endif /* !__ASSEMBLY__ */
+
+/* Control bits for startup_64 */
+#define	STARTUP_PARALLEL	0x80000000
+#define	STARTUP_SECONDARY	0x40000000
+
 #endif /* _ASM_X86_SMP_H */
diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 3b7f4cdbf2e0..06adf340a0f1 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -115,6 +115,7 @@  int x86_acpi_suspend_lowlevel(void)
 	early_gdt_descr.address =
 			(unsigned long)get_cpu_gdt_rw(smp_processor_id());
 	initial_gs = per_cpu_offset(smp_processor_id());
+	smpboot_control = 0;
 #endif
 	initial_code = (unsigned long)wakeup_long64;
        saved_magic = 0x123456789abcdef0L;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 20d9a604da7c..ac1d7e5da1f2 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2377,7 +2377,7 @@  static int nr_logical_cpuids = 1;
 /*
  * Used to store mapping between logical CPU IDs and APIC IDs.
  */
-static int cpuid_to_apicid[] = {
+int cpuid_to_apicid[] = {
 	[0 ... NR_CPUS - 1] = -1,
 };
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 222efd4a09bc..d07f694691d2 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -25,6 +25,7 @@ 
 #include <asm/export.h>
 #include <asm/nospec-branch.h>
 #include <asm/fixmap.h>
+#include <asm/smp.h>
 
 /*
  * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
@@ -241,6 +242,68 @@  SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	UNWIND_HINT_EMPTY
 	ANNOTATE_NOENDBR // above
 
+#ifdef CONFIG_SMP
+	/*
+	 * Is this the boot CPU coming up? If so everything is available
+	 * in initial_gs, initial_stack and early_gdt_descr.
+	 */
+	movl	smpboot_control(%rip), %eax
+	testl	%eax, %eax
+	jz	.Lsetup_cpu
+
+	/*
+	 * Secondary CPUs find out the offsets via the APIC ID. For parallel
+	 * boot the APIC ID is retrieved from CPUID, otherwise it's encoded
+	 * in smpboot_control:
+	 * Bit 0-29	APIC ID if STARTUP_PARALLEL flag is not set
+	 * Bit 30	STARTUP_SECONDARY flag
+	 * Bit 31	STARTUP_PARALLEL flag (use CPUID 0x0b for APIC ID)
+	 */
+	testl	$STARTUP_PARALLEL, %eax
+	jnz	.Luse_cpuid_0b
+	andl	$0x0FFFFFFF, %eax
+	jmp	.Lsetup_AP
+
+.Luse_cpuid_0b:
+	mov	$0x0B, %eax
+	xorl	%ecx, %ecx
+	cpuid
+	mov	%edx, %eax
+
+.Lsetup_AP:
+	/* EAX contains the APICID of the current CPU */
+	xorl	%ecx, %ecx
+	leaq	cpuid_to_apicid(%rip), %rbx
+
+.Lfind_cpunr:
+	cmpl	(%rbx), %eax
+	jz	.Linit_cpu_data
+	addq	$4, %rbx
+	addq	$8, %rcx
+	jmp	.Lfind_cpunr
+
+.Linit_cpu_data:
+	/* Get the per cpu offset */
+	leaq	__per_cpu_offset(%rip), %rbx
+	addq	%rcx, %rbx
+	movq	(%rbx), %rbx
+	/* Save it for GS BASE setup */
+	movq	%rbx, initial_gs(%rip)
+
+	/* Calculate the GDT address */
+	movq	$gdt_page, %rcx
+	addq	%rbx, %rcx
+	movq	%rcx, early_gdt_descr_base(%rip)
+
+	/* Find the idle task stack */
+	movq	$idle_threads, %rcx
+	addq	%rbx, %rcx
+	movq	(%rcx), %rcx
+	movq	TASK_threadsp(%rcx), %rcx
+	movq	%rcx, initial_stack(%rip)
+#endif /* CONFIG_SMP */
+
+.Lsetup_cpu:
 	/*
 	 * We must switch to a new descriptor in kernel space for the GDT
 	 * because soon the kernel won't have access anymore to the userspace
@@ -281,6 +344,14 @@  SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 */
 	movq initial_stack(%rip), %rsp
 
+	/* Drop the realmode protection. For the boot CPU the pointer is NULL! */
+	movq	trampoline_lock(%rip), %rax
+	testq	%rax, %rax
+	jz	.Lsetup_idt
+	lock
+	btrl	$0, (%rax)
+
+.Lsetup_idt:
 	/* Setup and Load IDT */
 	pushq	%rsi
 	call	early_setup_idt
@@ -426,6 +497,7 @@  SYM_DATA(initial_vc_handler,	.quad handle_vc_boot_ghcb)
  * reliably detect the end of the stack.
  */
 SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - FRAME_SIZE)
+SYM_DATA(trampoline_lock, .quad 0);
 	__FINITDATA
 
 	__INIT
@@ -660,6 +732,9 @@  SYM_DATA_END(level1_fixmap_pgt)
 SYM_DATA(early_gdt_descr,		.word GDT_ENTRIES*8-1)
 SYM_DATA_LOCAL(early_gdt_descr_base,	.quad INIT_PER_CPU_VAR(gdt_page))
 
+	.align 16
+SYM_DATA(smpboot_control,		.long 0)
+
 	.align 16
 /* This must match the first entry in level2_kernel_pgt */
 SYM_DATA(phys_base, .quad 0x0)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fdcf7c08945f..741da8d306a4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -800,6 +800,16 @@  static int __init cpu_init_udelay(char *str)
 }
 early_param("cpu_init_udelay", cpu_init_udelay);
 
+static bool do_parallel_bringup = true;
+
+static int __init no_parallel_bringup(char *str)
+{
+	do_parallel_bringup = false;
+
+	return 0;
+}
+early_param("no_parallel_bringup", no_parallel_bringup);
+
 static void __init smp_quirk_init_udelay(void)
 {
 	/* if cmdline changed it from default, leave it alone */
@@ -1087,8 +1097,6 @@  int common_cpu_up(unsigned int cpu, struct task_struct *idle)
 #ifdef CONFIG_X86_32
 	/* Stack for startup_32 can be just as for start_secondary onwards */
 	per_cpu(pcpu_hot.top_of_stack, cpu) = task_top_of_stack(idle);
-#else
-	initial_gs = per_cpu_offset(cpu);
 #endif
 	return 0;
 }
@@ -1113,9 +1121,16 @@  static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 		start_ip = real_mode_header->trampoline_start64;
 #endif
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
-	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
-	initial_stack  = idle->thread.sp;
+
+	if (IS_ENABLED(CONFIG_X86_32)) {
+		early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
+		initial_stack  = idle->thread.sp;
+	} else if (do_parallel_bringup) {
+		smpboot_control = STARTUP_SECONDARY | STARTUP_PARALLEL;
+	} else {
+		smpboot_control = STARTUP_SECONDARY | apicid;
+	}
 
 	/* Enable the espfix hack for this CPU */
 	init_espfix_ap(cpu);
@@ -1515,6 +1530,17 @@  void __init native_smp_prepare_cpus(unsigned int max_cpus)
 
 	speculative_store_bypass_ht_init();
 
+	/*
+	 * We can do 64-bit AP bringup in parallel if the CPU reports its
+	 * APIC ID in CPUID leaf 0x0B. Otherwise it's too hard. And not
+	 * for SEV-ES guests because they can't use CPUID that early.
+	 * Also, some AMD CPUs crash when doing parallel cpu bringup, disable
+	 * it for all AMD CPUs to be on the safe side.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 0x0B ||
+	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
+		do_parallel_bringup = false;
+
 	snp_set_wakeup_secondary_cpu();
 }
 
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index af565816d2ba..788e5559549f 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -154,6 +154,9 @@  static void __init setup_real_mode(void)
 
 	trampoline_header->flags = 0;
 
+	trampoline_lock = &trampoline_header->lock;
+	*trampoline_lock = 0;
+
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 
 	/* Map the real mode stub as virtual == physical */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index e38d61d6562e..49ebc1636ffd 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -49,6 +49,19 @@  SYM_CODE_START(trampoline_start)
 	mov	%ax, %es
 	mov	%ax, %ss
 
+	/*
+	 * Make sure only one CPU fiddles with the realmode stack
+	 */
+.Llock_rm:
+	btl	$0, tr_lock
+	jnc	2f
+	pause
+	jmp	.Llock_rm
+2:
+	lock
+	btsl	$0, tr_lock
+	jc	.Llock_rm
+
 	# Setup stack
 	movl	$rm_stack_end, %esp
 
@@ -241,6 +254,7 @@  SYM_DATA_START(trampoline_header)
 	SYM_DATA(tr_efer,		.space 8)
 	SYM_DATA(tr_cr4,		.space 4)
 	SYM_DATA(tr_flags,		.space 4)
+	SYM_DATA(tr_lock,		.space 4)
 SYM_DATA_END(trampoline_header)
 
 #include "trampoline_common.S"
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 2c7396da470c..a18a21dff9bc 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -25,7 +25,7 @@ 
  * For the hotplug case we keep the task structs around and reuse
  * them.
  */
-static DEFINE_PER_CPU(struct task_struct *, idle_threads);
+DEFINE_PER_CPU(struct task_struct *, idle_threads);
 
 struct task_struct *idle_thread_get(unsigned int cpu)
 {