Message ID | 20230221223352.2288528-7-usama.arif@bytedance.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Parallel CPU bringup for x86_64 | expand |
On Tue, Feb 21, 2023 at 10:33:50PM +0000, Usama Arif wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Rework the real-mode startup code to allow for APs to be brought up in > parallel. This is in two parts: > > 1. Introduce a bit-spinlock to prevent them from all using the real > mode stack at the same time. > > 2. Avoid the use of global variables for passing per-CPU information to > the APs. > > To achieve the latter, export the cpuid_to_apicid[] array so that each > AP can find its own per_cpu data (and thus initial_gs, initial_stack and > early_gdt_descr) by searching therein based on its APIC ID. > > Introduce a global variable 'smpboot_control' indicating to the AP how > it should find its APIC ID. For a serialized bringup, the APIC ID is > explicitly passed in the low bits of smpboot_control, while for parallel > mode there are flags directing the AP to find its APIC ID in CPUID leaf > 0x0b (for X2APIC mode) or CPUID leaf 0x01 where 8 bits are sufficient. > > Parallel startup may be disabled by a command line option, and also if: > • AMD SEV-ES is in use, since the AP may not use CPUID that early. > • X2APIC is enabled, but CPUID leaf 0xb is not present and correect. > • X2APIC is not enabled but not even CPUID leaf 0x01 exists. > > Aside from the fact that APs will now look up their per-cpu data via the > newly-exported cpuid_to_apicid[] table, there is no behavioural change > intended yet, since new parallel CPUHP states have not — yet — been > added. > > [ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ] > [ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ] > [ seanc: Fix stray override of initial_gs in common_cpu_up() ] > [ Oleksandr Natalenko: reported suspend/resume issue fixed in > x86_acpi_suspend_lowlevel ] > Co-developed-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > Signed-off-by: Usama Arif <usama.arif@bytedance.com> > Tested-by: Paul E. McKenney <paulmck@kernel.org> > Tested-by: Kim Phillips <kim.phillips@amd.com> > Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> > --- > .../admin-guide/kernel-parameters.txt | 3 + > arch/x86/include/asm/realmode.h | 3 + > arch/x86/include/asm/smp.h | 10 +- > arch/x86/kernel/acpi/sleep.c | 7 ++ > arch/x86/kernel/apic/apic.c | 2 +- > arch/x86/kernel/head_64.S | 99 ++++++++++++++++++- > arch/x86/kernel/smpboot.c | 62 +++++++++++- > arch/x86/realmode/init.c | 3 + > arch/x86/realmode/rm/trampoline_64.S | 14 +++ > kernel/smpboot.c | 2 +- > 10 files changed, 197 insertions(+), 8 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 6cfa6e3996cf..ee099b8aac6d 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3819,6 +3819,9 @@ > > nomodule Disable module load > > + no_parallel_bringup > + [X86,SMP] Disable parallel brinugp of secondary cores. > + > nopat [X86] Disable PAT (page attribute table extension of > pagetables) support. > > diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h > index a336feef0af1..f0357cfe2fb0 100644 > --- a/arch/x86/include/asm/realmode.h > +++ b/arch/x86/include/asm/realmode.h > @@ -52,6 +52,7 @@ struct trampoline_header { > u64 efer; > u32 cr4; > u32 flags; > + u32 lock; > #endif > }; > > @@ -65,6 +66,8 @@ extern unsigned long initial_stack; > extern unsigned long initial_vc_handler; > #endif > > +extern u32 *trampoline_lock; > + > extern unsigned char real_mode_blob[]; > extern unsigned char real_mode_relocs[]; > > diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h > index b4dbb20dab1a..33c0d5fd8af6 100644 > --- a/arch/x86/include/asm/smp.h > +++ b/arch/x86/include/asm/smp.h > @@ -199,5 +199,13 @@ extern void nmi_selftest(void); > #define nmi_selftest() do { } while (0) > #endif > > -#endif /* __ASSEMBLY__ */ > +extern unsigned int smpboot_control; > + > +#endif /* !__ASSEMBLY__ */ > + > +/* Control bits for startup_64 */ > +#define STARTUP_SECONDARY 0x80000000 > +#define STARTUP_APICID_CPUID_0B 0x40000000 > +#define STARTUP_APICID_CPUID_01 0x20000000 > + > #endif /* _ASM_X86_SMP_H */ > diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c > index 3b7f4cdbf2e0..47e75c056cb5 100644 > --- a/arch/x86/kernel/acpi/sleep.c > +++ b/arch/x86/kernel/acpi/sleep.c > @@ -16,6 +16,7 @@ > #include <asm/cacheflush.h> > #include <asm/realmode.h> > #include <asm/hypervisor.h> > +#include <asm/smp.h> > > #include <linux/ftrace.h> > #include "../../realmode/rm/wakeup.h" > @@ -57,6 +58,7 @@ asmlinkage acpi_status __visible x86_acpi_enter_sleep_state(u8 state) > */ > int x86_acpi_suspend_lowlevel(void) > { > + unsigned int __maybe_unused saved_smpboot_ctrl; > struct wakeup_header *header = > (struct wakeup_header *) __va(real_mode_header->wakeup_header); > > @@ -115,6 +117,8 @@ int x86_acpi_suspend_lowlevel(void) > early_gdt_descr.address = > (unsigned long)get_cpu_gdt_rw(smp_processor_id()); > initial_gs = per_cpu_offset(smp_processor_id()); > + /* Force the startup into boot mode */ > + saved_smpboot_ctrl = xchg(&smpboot_control, 0); > #endif > initial_code = (unsigned long)wakeup_long64; > saved_magic = 0x123456789abcdef0L; > @@ -127,6 +131,9 @@ int x86_acpi_suspend_lowlevel(void) > pause_graph_tracing(); > do_suspend_lowlevel(); > unpause_graph_tracing(); > + > + if (IS_ENABLED(CONFIG_64BIT) && IS_ENABLED(CONFIG_SMP)) > + smpboot_control = saved_smpboot_ctrl; > return 0; > } > > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c > index 20d9a604da7c..ac1d7e5da1f2 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -2377,7 +2377,7 @@ static int nr_logical_cpuids = 1; > /* > * Used to store mapping between logical CPU IDs and APIC IDs. > */ > -static int cpuid_to_apicid[] = { > +int cpuid_to_apicid[] = { > [0 ... NR_CPUS - 1] = -1, > }; > > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S > index 222efd4a09bc..0e4e53d231db 100644 > --- a/arch/x86/kernel/head_64.S > +++ b/arch/x86/kernel/head_64.S > @@ -25,6 +25,7 @@ > #include <asm/export.h> > #include <asm/nospec-branch.h> > #include <asm/fixmap.h> > +#include <asm/smp.h> > > /* > * We are not able to switch in one step to the final KERNEL ADDRESS SPACE > @@ -241,6 +242,83 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) > UNWIND_HINT_EMPTY > ANNOTATE_NOENDBR // above > > +#ifdef CONFIG_SMP > + /* > + * Is this the boot CPU coming up? If so everything is available > + * in initial_gs, initial_stack and early_gdt_descr. > + */ > + movl smpboot_control(%rip), %edx > + testl $STARTUP_SECONDARY, %edx > + jz .Lsetup_cpu > + > + /* > + * Secondary CPUs find out the offsets via the APIC ID. For parallel > + * boot the APIC ID is retrieved from CPUID, otherwise it's encoded > + * in smpboot_control: > + * Bit 31 STARTUP_SECONDARY flag (checked above) > + * Bit 30 STARTUP_APICID_CPUID_0B flag (use CPUID 0x0b) > + * Bit 29 STARTUP_APICID_CPUID_01 flag (use CPUID 0x01) > + * Bit 0-24 APIC ID if STARTUP_APICID_CPUID_xx flags are not set > + */ > + testl $STARTUP_APICID_CPUID_0B, %edx > + jnz .Luse_cpuid_0b > + testl $STARTUP_APICID_CPUID_01, %edx > + jnz .Luse_cpuid_01 > + andl $0x0FFFFFFF, %edx > + jmp .Lsetup_AP > + > +.Luse_cpuid_01: > + mov $0x01, %eax > + cpuid > + mov %ebx, %edx > + shr $24, %edx > + jmp .Lsetup_AP > + > +.Luse_cpuid_0b: > + mov $0x0B, %eax > + xorl %ecx, %ecx > + cpuid > + > +.Lsetup_AP: > + /* EDX contains the APIC ID of the current CPU */ > + xorq %rcx, %rcx > + leaq cpuid_to_apicid(%rip), %rbx > + > +.Lfind_cpunr: > + cmpl (%rbx,%rcx,4), %edx > + jz .Linit_cpu_data > + inc %ecx > + cmpl nr_cpu_ids(%rip), %ecx > + jb .Lfind_cpunr > + > + /* APIC ID not found in the table. Drop the trampoline lock and bail. */ > + movq trampoline_lock(%rip), %rax > + lock > + btrl $0, (%rax) > + > +1: cli > + hlt > + jmp 1b > + > +.Linit_cpu_data: > + /* Get the per cpu offset for the given CPU# which is in ECX */ > + leaq __per_cpu_offset(%rip), %rbx > + movq (%rbx,%rcx,8), %rbx > + /* Save it for GS BASE setup */ > + movq %rbx, initial_gs(%rip) > + > + /* Calculate the GDT address */ > + movq $gdt_page, %rcx > + addq %rbx, %rcx > + movq %rcx, early_gdt_descr_base(%rip) > + > + /* Find the idle task stack */ > + movq idle_threads(%rbx), %rcx > + movq TASK_threadsp(%rcx), %rcx > + movq %rcx, initial_stack(%rip) > +#endif /* CONFIG_SMP */ > + > +.Lsetup_cpu: > /* > * We must switch to a new descriptor in kernel space for the GDT > * because soon the kernel won't have access anymore to the userspace > @@ -281,6 +359,14 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) > */ > movq initial_stack(%rip), %rsp > > + /* Drop the realmode protection. For the boot CPU the pointer is NULL! */ > + movq trampoline_lock(%rip), %rax > + testq %rax, %rax > + jz .Lsetup_idt > + lock > + btrl $0, (%rax) > + > +.Lsetup_idt: > /* Setup and Load IDT */ > pushq %rsi > call early_setup_idt > @@ -372,7 +458,14 @@ SYM_CODE_END(secondary_startup_64) > SYM_CODE_START(start_cpu0) > ANNOTATE_NOENDBR > UNWIND_HINT_EMPTY > - movq initial_stack(%rip), %rsp > + /* Load the per-cpu base for CPU#0 */ > + leaq __per_cpu_offset(%rip), %rbx > + movq (%rbx), %rbx > + > + /* Find the idle task stack */ > + movq idle_threads(%rbx), %rcx > + movq TASK_threadsp(%rcx), %rsp > + > jmp .Ljump_to_C_code > SYM_CODE_END(start_cpu0) > #endif > @@ -426,6 +519,7 @@ SYM_DATA(initial_vc_handler, .quad handle_vc_boot_ghcb) > * reliably detect the end of the stack. > */ > SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - FRAME_SIZE) > +SYM_DATA(trampoline_lock, .quad 0); > __FINITDATA > > __INIT > @@ -660,6 +754,9 @@ SYM_DATA_END(level1_fixmap_pgt) > SYM_DATA(early_gdt_descr, .word GDT_ENTRIES*8-1) > SYM_DATA_LOCAL(early_gdt_descr_base, .quad INIT_PER_CPU_VAR(gdt_page)) > > + .align 16 > +SYM_DATA(smpboot_control, .long 0) > + > .align 16 > /* This must match the first entry in level2_kernel_pgt */ > SYM_DATA(phys_base, .quad 0x0) > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index b18c1385e181..6924d91b69ca 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -797,6 +797,16 @@ static int __init cpu_init_udelay(char *str) > } > early_param("cpu_init_udelay", cpu_init_udelay); > > +static bool do_parallel_bringup __ro_after_init = true; > + > +static int __init no_parallel_bringup(char *str) > +{ > + do_parallel_bringup = false; > + > + return 0; > +} > +early_param("no_parallel_bringup", no_parallel_bringup); > + > static void __init smp_quirk_init_udelay(void) > { > /* if cmdline changed it from default, leave it alone */ > @@ -1084,8 +1094,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) > #ifdef CONFIG_X86_32 > /* Stack for startup_32 can be just as for start_secondary onwards */ > per_cpu(pcpu_hot.top_of_stack, cpu) = task_top_of_stack(idle); > -#else > - initial_gs = per_cpu_offset(cpu); > #endif > return 0; > } > @@ -1110,9 +1118,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, > start_ip = real_mode_header->trampoline_start64; > #endif > idle->thread.sp = (unsigned long)task_pt_regs(idle); > - early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); > initial_code = (unsigned long)start_secondary; > - initial_stack = idle->thread.sp; > + > + if (IS_ENABLED(CONFIG_X86_32)) { > + early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); > + initial_stack = idle->thread.sp; > + } else if (!do_parallel_bringup) { > + smpboot_control = STARTUP_SECONDARY | apicid; > + } > > /* Enable the espfix hack for this CPU */ > init_espfix_ap(cpu); > @@ -1512,6 +1525,47 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) > > speculative_store_bypass_ht_init(); > > + /* > + * We can do 64-bit AP bringup in parallel if the CPU reports > + * its APIC ID in CPUID (either leaf 0x0B if we need the full > + * APIC ID in X2APIC mode, or leaf 0x01 if 8 bits are > + * sufficient). Otherwise it's too hard. And not for SEV-ES > + * guests because they can't use CPUID that early. > + */ > + if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 1 || > + (x2apic_mode && boot_cpu_data.cpuid_level < 0xb) || > + cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) > + do_parallel_bringup = false; > + > + if (do_parallel_bringup && x2apic_mode) { > + unsigned int eax, ebx, ecx, edx; > + > + /* > + * To support parallel bringup in x2apic mode, the AP will need > + * to obtain its APIC ID from CPUID 0x0B, since CPUID 0x01 has > + * only 8 bits. Check that it is present and seems correct. > + */ > + cpuid_count(0xb, 0, &eax, &ebx, &ecx, &edx); > + > + /* > + * AMD says that if executed with an umimplemented level in > + * ECX, then it will return all zeroes in EAX. Intel says it > + * will return zeroes in both EAX and EBX. Checking only EAX > + * should be sufficient. > + */ > + if (eax) { > + pr_debug("Using CPUID 0xb for parallel CPU startup\n"); > + smpboot_control = STARTUP_SECONDARY | STARTUP_APICID_CPUID_0B; > + } else { > + pr_info("Disabling parallel bringup because CPUID 0xb looks untrustworthy\n"); > + do_parallel_bringup = false; > + } > + } else if (do_parallel_bringup) { > + /* Without X2APIC, what's in CPUID 0x01 should suffice. */ > + pr_debug("Using CPUID 0x1 for parallel CPU startup\n"); > + smpboot_control = STARTUP_SECONDARY | STARTUP_APICID_CPUID_01; > + } > + > snp_set_wakeup_secondary_cpu(); > } > > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index af565816d2ba..788e5559549f 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -154,6 +154,9 @@ static void __init setup_real_mode(void) > > trampoline_header->flags = 0; > > + trampoline_lock = &trampoline_header->lock; > + *trampoline_lock = 0; > + > trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd); > > /* Map the real mode stub as virtual == physical */ > diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S > index e38d61d6562e..49ebc1636ffd 100644 > --- a/arch/x86/realmode/rm/trampoline_64.S > +++ b/arch/x86/realmode/rm/trampoline_64.S > @@ -49,6 +49,19 @@ SYM_CODE_START(trampoline_start) > mov %ax, %es > mov %ax, %ss > > + /* > + * Make sure only one CPU fiddles with the realmode stack > + */ > +.Llock_rm: > + btl $0, tr_lock > + jnc 2f > + pause > + jmp .Llock_rm > +2: > + lock > + btsl $0, tr_lock > + jc .Llock_rm > + Looks these changes should be applied to trampoline_start64() yet, which is used for boot up APs when apic->wakeup_secondary_cpu_64 is available, e.g when ACPI_MADT_TYPE_MULTIPROC_WAKEUP is available. One case I know is the INTEL TD guest, which using the MADT wakeup for AP wake up now. > # Setup stack > movl $rm_stack_end, %esp > > @@ -241,6 +254,7 @@ SYM_DATA_START(trampoline_header) > SYM_DATA(tr_efer, .space 8) > SYM_DATA(tr_cr4, .space 4) > SYM_DATA(tr_flags, .space 4) > + SYM_DATA(tr_lock, .space 4) > SYM_DATA_END(trampoline_header) > > #include "trampoline_common.S" > diff --git a/kernel/smpboot.c b/kernel/smpboot.c > index 2c7396da470c..a18a21dff9bc 100644 > --- a/kernel/smpboot.c > +++ b/kernel/smpboot.c > @@ -25,7 +25,7 @@ > * For the hotplug case we keep the task structs around and reuse > * them. > */ > -static DEFINE_PER_CPU(struct task_struct *, idle_threads); > +DEFINE_PER_CPU(struct task_struct *, idle_threads); > > struct task_struct *idle_thread_get(unsigned int cpu) > { > -- > 2.25.1 >
On Fri, 2023-02-24 at 13:59 +0800, Yuan Yao wrote: > > > + * Make sure only one CPU fiddles with the realmode stack > > + */ > > +.Llock_rm: > > + btl $0, tr_lock > > + jnc 2f > > + pause > > + jmp .Llock_rm > > +2: > > + lock > > + btsl $0, tr_lock > > + jc .Llock_rm > > + > > Looks these changes should be applied to trampoline_start64() > yet, which is used for boot up APs when apic->wakeup_secondary_cpu_64 > is available, e.g when ACPI_MADT_TYPE_MULTIPROC_WAKEUP is available. > > One case I know is the INTEL TD guest, which using the MADT wakeup > for AP wake up now. Yeah. I think we could probably pull that all out into a separate asm "load_realmode_esp" function which takes the lock and actually does the 'movl $rm_stack_end, %esp'. Then we call it from all the places which currently use $rm_stack_end, including sev_es_trampoline_start (even though we currently disable parallel startup there because CPUID doesn't work that early). Oh... except of course it can't be a function because we haven't got a stack, have we? It's a macro. Usama, are you happy using .macro in gas?
On 24/02/2023 15:46, David Woodhouse wrote: > On Fri, 2023-02-24 at 13:59 +0800, Yuan Yao wrote: >> >>> + * Make sure only one CPU fiddles with the realmode stack >>> + */ >>> +.Llock_rm: >>> + btl $0, tr_lock >>> + jnc 2f >>> + pause >>> + jmp .Llock_rm >>> +2: >>> + lock >>> + btsl $0, tr_lock >>> + jc .Llock_rm >>> + >> >> Looks these changes should be applied to trampoline_start64() >> yet, which is used for boot up APs when apic->wakeup_secondary_cpu_64 >> is available, e.g when ACPI_MADT_TYPE_MULTIPROC_WAKEUP is available. >> >> One case I know is the INTEL TD guest, which using the MADT wakeup >> for AP wake up now. > > Yeah. > > I think we could probably pull that all out into a separate asm > "load_realmode_esp" function which takes the lock and actually does the > 'movl $rm_stack_end, %esp'. > > Then we call it from all the places which currently use $rm_stack_end, > including sev_es_trampoline_start (even though we currently disable > parallel startup there because CPUID doesn't work that early). > > Oh... except of course it can't be a function because we haven't got a > stack, have we? It's a macro. > > Usama, are you happy using .macro in gas? Yes, makes sense. I guess something like below should be ok? It makes sense to add it for sev as well. diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index e38d61d6562e..3a724d8d85b9 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -38,6 +38,25 @@ .code16 .balign PAGE_SIZE + +.macro LOAD_REALMODE_ESP + /* + * Make sure only one CPU fiddles with the realmode stack + */ +.Llock_rm\@: + btl $0, tr_lock + jnc 2f + pause + jmp .Llock_rm\@ +2: + lock + btsl $0, tr_lock + jc .Llock_rm\@ + + # Setup stack + movl $rm_stack_end, %esp +.endm + SYM_CODE_START(trampoline_start) cli # We should be safe anyway wbinvd @@ -49,8 +68,7 @@ SYM_CODE_START(trampoline_start) mov %ax, %es mov %ax, %ss - # Setup stack - movl $rm_stack_end, %esp + LOAD_REALMODE_ESP call verify_cpu # Verify the cpu supports long mode testl %eax, %eax # Check for return code @@ -93,8 +111,7 @@ SYM_CODE_START(sev_es_trampoline_start) mov %ax, %es mov %ax, %ss - # Setup stack - movl $rm_stack_end, %esp + LOAD_REALMODE_ESP jmp .Lswitch_to_protected SYM_CODE_END(sev_es_trampoline_start) @@ -177,7 +194,7 @@ SYM_CODE_START(pa_trampoline_compat) * In compatibility mode. Prep ESP and DX for startup_32, then disable * paging and complete the switch to legacy 32-bit mode. */ - movl $rm_stack_end, %esp + LOAD_REALMODE_ESP movw $__KERNEL_DS, %dx movl $(CR0_STATE & ~X86_CR0_PG), %eax
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6cfa6e3996cf..ee099b8aac6d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3819,6 +3819,9 @@ nomodule Disable module load + no_parallel_bringup + [X86,SMP] Disable parallel brinugp of secondary cores. + nopat [X86] Disable PAT (page attribute table extension of pagetables) support. diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h index a336feef0af1..f0357cfe2fb0 100644 --- a/arch/x86/include/asm/realmode.h +++ b/arch/x86/include/asm/realmode.h @@ -52,6 +52,7 @@ struct trampoline_header { u64 efer; u32 cr4; u32 flags; + u32 lock; #endif }; @@ -65,6 +66,8 @@ extern unsigned long initial_stack; extern unsigned long initial_vc_handler; #endif +extern u32 *trampoline_lock; + extern unsigned char real_mode_blob[]; extern unsigned char real_mode_relocs[]; diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index b4dbb20dab1a..33c0d5fd8af6 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -199,5 +199,13 @@ extern void nmi_selftest(void); #define nmi_selftest() do { } while (0) #endif -#endif /* __ASSEMBLY__ */ +extern unsigned int smpboot_control; + +#endif /* !__ASSEMBLY__ */ + +/* Control bits for startup_64 */ +#define STARTUP_SECONDARY 0x80000000 +#define STARTUP_APICID_CPUID_0B 0x40000000 +#define STARTUP_APICID_CPUID_01 0x20000000 + #endif /* _ASM_X86_SMP_H */ diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c index 3b7f4cdbf2e0..47e75c056cb5 100644 --- a/arch/x86/kernel/acpi/sleep.c +++ b/arch/x86/kernel/acpi/sleep.c @@ -16,6 +16,7 @@ #include <asm/cacheflush.h> #include <asm/realmode.h> #include <asm/hypervisor.h> +#include <asm/smp.h> #include <linux/ftrace.h> #include "../../realmode/rm/wakeup.h" @@ -57,6 +58,7 @@ asmlinkage acpi_status __visible x86_acpi_enter_sleep_state(u8 state) */ int x86_acpi_suspend_lowlevel(void) { + unsigned int __maybe_unused saved_smpboot_ctrl; struct wakeup_header *header = (struct wakeup_header *) __va(real_mode_header->wakeup_header); @@ -115,6 +117,8 @@ int x86_acpi_suspend_lowlevel(void) early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(smp_processor_id()); initial_gs = per_cpu_offset(smp_processor_id()); + /* Force the startup into boot mode */ + saved_smpboot_ctrl = xchg(&smpboot_control, 0); #endif initial_code = (unsigned long)wakeup_long64; saved_magic = 0x123456789abcdef0L; @@ -127,6 +131,9 @@ int x86_acpi_suspend_lowlevel(void) pause_graph_tracing(); do_suspend_lowlevel(); unpause_graph_tracing(); + + if (IS_ENABLED(CONFIG_64BIT) && IS_ENABLED(CONFIG_SMP)) + smpboot_control = saved_smpboot_ctrl; return 0; } diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 20d9a604da7c..ac1d7e5da1f2 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2377,7 +2377,7 @@ static int nr_logical_cpuids = 1; /* * Used to store mapping between logical CPU IDs and APIC IDs. */ -static int cpuid_to_apicid[] = { +int cpuid_to_apicid[] = { [0 ... NR_CPUS - 1] = -1, }; diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 222efd4a09bc..0e4e53d231db 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -25,6 +25,7 @@ #include <asm/export.h> #include <asm/nospec-branch.h> #include <asm/fixmap.h> +#include <asm/smp.h> /* * We are not able to switch in one step to the final KERNEL ADDRESS SPACE @@ -241,6 +242,83 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) UNWIND_HINT_EMPTY ANNOTATE_NOENDBR // above +#ifdef CONFIG_SMP + /* + * Is this the boot CPU coming up? If so everything is available + * in initial_gs, initial_stack and early_gdt_descr. + */ + movl smpboot_control(%rip), %edx + testl $STARTUP_SECONDARY, %edx + jz .Lsetup_cpu + + /* + * Secondary CPUs find out the offsets via the APIC ID. For parallel + * boot the APIC ID is retrieved from CPUID, otherwise it's encoded + * in smpboot_control: + * Bit 31 STARTUP_SECONDARY flag (checked above) + * Bit 30 STARTUP_APICID_CPUID_0B flag (use CPUID 0x0b) + * Bit 29 STARTUP_APICID_CPUID_01 flag (use CPUID 0x01) + * Bit 0-24 APIC ID if STARTUP_APICID_CPUID_xx flags are not set + */ + testl $STARTUP_APICID_CPUID_0B, %edx + jnz .Luse_cpuid_0b + testl $STARTUP_APICID_CPUID_01, %edx + jnz .Luse_cpuid_01 + andl $0x0FFFFFFF, %edx + jmp .Lsetup_AP + +.Luse_cpuid_01: + mov $0x01, %eax + cpuid + mov %ebx, %edx + shr $24, %edx + jmp .Lsetup_AP + +.Luse_cpuid_0b: + mov $0x0B, %eax + xorl %ecx, %ecx + cpuid + +.Lsetup_AP: + /* EDX contains the APIC ID of the current CPU */ + xorq %rcx, %rcx + leaq cpuid_to_apicid(%rip), %rbx + +.Lfind_cpunr: + cmpl (%rbx,%rcx,4), %edx + jz .Linit_cpu_data + inc %ecx + cmpl nr_cpu_ids(%rip), %ecx + jb .Lfind_cpunr + + /* APIC ID not found in the table. Drop the trampoline lock and bail. */ + movq trampoline_lock(%rip), %rax + lock + btrl $0, (%rax) + +1: cli + hlt + jmp 1b + +.Linit_cpu_data: + /* Get the per cpu offset for the given CPU# which is in ECX */ + leaq __per_cpu_offset(%rip), %rbx + movq (%rbx,%rcx,8), %rbx + /* Save it for GS BASE setup */ + movq %rbx, initial_gs(%rip) + + /* Calculate the GDT address */ + movq $gdt_page, %rcx + addq %rbx, %rcx + movq %rcx, early_gdt_descr_base(%rip) + + /* Find the idle task stack */ + movq idle_threads(%rbx), %rcx + movq TASK_threadsp(%rcx), %rcx + movq %rcx, initial_stack(%rip) +#endif /* CONFIG_SMP */ + +.Lsetup_cpu: /* * We must switch to a new descriptor in kernel space for the GDT * because soon the kernel won't have access anymore to the userspace @@ -281,6 +359,14 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) */ movq initial_stack(%rip), %rsp + /* Drop the realmode protection. For the boot CPU the pointer is NULL! */ + movq trampoline_lock(%rip), %rax + testq %rax, %rax + jz .Lsetup_idt + lock + btrl $0, (%rax) + +.Lsetup_idt: /* Setup and Load IDT */ pushq %rsi call early_setup_idt @@ -372,7 +458,14 @@ SYM_CODE_END(secondary_startup_64) SYM_CODE_START(start_cpu0) ANNOTATE_NOENDBR UNWIND_HINT_EMPTY - movq initial_stack(%rip), %rsp + /* Load the per-cpu base for CPU#0 */ + leaq __per_cpu_offset(%rip), %rbx + movq (%rbx), %rbx + + /* Find the idle task stack */ + movq idle_threads(%rbx), %rcx + movq TASK_threadsp(%rcx), %rsp + jmp .Ljump_to_C_code SYM_CODE_END(start_cpu0) #endif @@ -426,6 +519,7 @@ SYM_DATA(initial_vc_handler, .quad handle_vc_boot_ghcb) * reliably detect the end of the stack. */ SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - FRAME_SIZE) +SYM_DATA(trampoline_lock, .quad 0); __FINITDATA __INIT @@ -660,6 +754,9 @@ SYM_DATA_END(level1_fixmap_pgt) SYM_DATA(early_gdt_descr, .word GDT_ENTRIES*8-1) SYM_DATA_LOCAL(early_gdt_descr_base, .quad INIT_PER_CPU_VAR(gdt_page)) + .align 16 +SYM_DATA(smpboot_control, .long 0) + .align 16 /* This must match the first entry in level2_kernel_pgt */ SYM_DATA(phys_base, .quad 0x0) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index b18c1385e181..6924d91b69ca 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -797,6 +797,16 @@ static int __init cpu_init_udelay(char *str) } early_param("cpu_init_udelay", cpu_init_udelay); +static bool do_parallel_bringup __ro_after_init = true; + +static int __init no_parallel_bringup(char *str) +{ + do_parallel_bringup = false; + + return 0; +} +early_param("no_parallel_bringup", no_parallel_bringup); + static void __init smp_quirk_init_udelay(void) { /* if cmdline changed it from default, leave it alone */ @@ -1084,8 +1094,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) #ifdef CONFIG_X86_32 /* Stack for startup_32 can be just as for start_secondary onwards */ per_cpu(pcpu_hot.top_of_stack, cpu) = task_top_of_stack(idle); -#else - initial_gs = per_cpu_offset(cpu); #endif return 0; } @@ -1110,9 +1118,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, start_ip = real_mode_header->trampoline_start64; #endif idle->thread.sp = (unsigned long)task_pt_regs(idle); - early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); initial_code = (unsigned long)start_secondary; - initial_stack = idle->thread.sp; + + if (IS_ENABLED(CONFIG_X86_32)) { + early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); + initial_stack = idle->thread.sp; + } else if (!do_parallel_bringup) { + smpboot_control = STARTUP_SECONDARY | apicid; + } /* Enable the espfix hack for this CPU */ init_espfix_ap(cpu); @@ -1512,6 +1525,47 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) speculative_store_bypass_ht_init(); + /* + * We can do 64-bit AP bringup in parallel if the CPU reports + * its APIC ID in CPUID (either leaf 0x0B if we need the full + * APIC ID in X2APIC mode, or leaf 0x01 if 8 bits are + * sufficient). Otherwise it's too hard. And not for SEV-ES + * guests because they can't use CPUID that early. + */ + if (IS_ENABLED(CONFIG_X86_32) || boot_cpu_data.cpuid_level < 1 || + (x2apic_mode && boot_cpu_data.cpuid_level < 0xb) || + cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) + do_parallel_bringup = false; + + if (do_parallel_bringup && x2apic_mode) { + unsigned int eax, ebx, ecx, edx; + + /* + * To support parallel bringup in x2apic mode, the AP will need + * to obtain its APIC ID from CPUID 0x0B, since CPUID 0x01 has + * only 8 bits. Check that it is present and seems correct. + */ + cpuid_count(0xb, 0, &eax, &ebx, &ecx, &edx); + + /* + * AMD says that if executed with an umimplemented level in + * ECX, then it will return all zeroes in EAX. Intel says it + * will return zeroes in both EAX and EBX. Checking only EAX + * should be sufficient. + */ + if (eax) { + pr_debug("Using CPUID 0xb for parallel CPU startup\n"); + smpboot_control = STARTUP_SECONDARY | STARTUP_APICID_CPUID_0B; + } else { + pr_info("Disabling parallel bringup because CPUID 0xb looks untrustworthy\n"); + do_parallel_bringup = false; + } + } else if (do_parallel_bringup) { + /* Without X2APIC, what's in CPUID 0x01 should suffice. */ + pr_debug("Using CPUID 0x1 for parallel CPU startup\n"); + smpboot_control = STARTUP_SECONDARY | STARTUP_APICID_CPUID_01; + } + snp_set_wakeup_secondary_cpu(); } diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index af565816d2ba..788e5559549f 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -154,6 +154,9 @@ static void __init setup_real_mode(void) trampoline_header->flags = 0; + trampoline_lock = &trampoline_header->lock; + *trampoline_lock = 0; + trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd); /* Map the real mode stub as virtual == physical */ diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index e38d61d6562e..49ebc1636ffd 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -49,6 +49,19 @@ SYM_CODE_START(trampoline_start) mov %ax, %es mov %ax, %ss + /* + * Make sure only one CPU fiddles with the realmode stack + */ +.Llock_rm: + btl $0, tr_lock + jnc 2f + pause + jmp .Llock_rm +2: + lock + btsl $0, tr_lock + jc .Llock_rm + # Setup stack movl $rm_stack_end, %esp @@ -241,6 +254,7 @@ SYM_DATA_START(trampoline_header) SYM_DATA(tr_efer, .space 8) SYM_DATA(tr_cr4, .space 4) SYM_DATA(tr_flags, .space 4) + SYM_DATA(tr_lock, .space 4) SYM_DATA_END(trampoline_header) #include "trampoline_common.S" diff --git a/kernel/smpboot.c b/kernel/smpboot.c index 2c7396da470c..a18a21dff9bc 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -25,7 +25,7 @@ * For the hotplug case we keep the task structs around and reuse * them. */ -static DEFINE_PER_CPU(struct task_struct *, idle_threads); +DEFINE_PER_CPU(struct task_struct *, idle_threads); struct task_struct *idle_thread_get(unsigned int cpu) {