Message ID | 20230508185217.671595388@linutronix.de (mailing list archive) |
---|---|
State | Awaiting Upstream |
Headers | show |
Series | cpu/hotplug, x86: Reworked parallel CPU bringup | expand |
On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: > @@ -233,14 +237,31 @@ static void notrace start_secondary(void > load_cr3(swapper_pg_dir); > __flush_tlb_all(); > #endif > + /* > + * Sync point with wait_cpu_initialized(). Before proceeding through > + * cpu_init(), the AP will call wait_for_master_cpu() which sets its > + * own bit in cpu_initialized_mask and then waits for the BSP to set > + * its bit in cpu_callout_mask to release it. > + */ > cpu_init_secondary(); > rcu_cpu_starting(raw_smp_processor_id()); > x86_cpuinit.early_percpu_clock_init(); > + > + /* > + * Sync point with wait_cpu_callin(). The AP doesn't wait here > + * but just sets the bit to let the controlling CPU (BSP) know that > + * it's got this far. > + */ > smp_callin(); > > - /* otherwise gcc will move up smp_processor_id before the cpu_init */ > + /* Otherwise gcc will move up smp_processor_id() before cpu_init() */ > barrier(); Not to the detriment of this patch, but this barrier() and it's comment seem weird vs smp_callin(). That function ends with an atomic bitop (it has to, at the very least it must not be weaker than store-release) but also has an explicit wmb() to order setup vs CPU_STARTING. (arguably that should be a full fence *AND* get a comment) There is no way the smp_processor_id() referred to in this comment can land before cpu_init() even without the barrier(). > - /* Check TSC synchronization with the control CPU: */ > + > + /* > + * Check TSC synchronization with the control CPU, which will do > + * its part of this from wait_cpu_online(), making it an implicit > + * synchronization point. > + */ > check_tsc_sync_target(); > > /*
Again, not really this patch, but since I had to look at this code .... On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: > @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c /* * AP might wait on cpu_callout_mask in cpu_init() with * cpu_initialized_mask set if previous attempt to online * it timed-out. Clear cpu_initialized_mask so that after * INIT/SIPI it could start with a clean state. */ cpumask_clear_cpu(cpu, cpu_initialized_mask); smp_mb(); ^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and implies much the same (this is x86 after all). If you want to be super explicit about it write: smp_mb__after_atomic(); (which is a no-op) but then it still very much requires a comment as to what exactly it orders against what. /* * Wake up a CPU in difference cases: * - Use a method from the APIC driver if one defined, with wakeup * straight to 64-bit mode preferred over wakeup to RM. * Otherwise, > * - Use an INIT boot APIC message > */ > if (apic->wakeup_secondary_cpu_64) > + return apic->wakeup_secondary_cpu_64(apicid, start_ip); > else if (apic->wakeup_secondary_cpu) > + return apic->wakeup_secondary_cpu(apicid, start_ip); > > + return wakeup_secondary_cpu_via_init(apicid, start_ip); > +}
And since I'm commenting on existing things anyway, let me continue... On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: > +static int wait_cpu_cpumask(unsigned int cpu, const struct cpumask *mask) > +{ > + unsigned long timeout; > > + /* > + * Wait up to 10s for the CPU to report in. > + */ > + timeout = jiffies + 10*HZ; > + while (time_before(jiffies, timeout)) { > + if (cpumask_test_cpu(cpu, mask)) > + return 0; > + > + schedule(); > } > + return -1; > +} > +/* > + * Bringup step three: Wait for the target AP to reach smp_callin(). > + * The AP is not waiting for us here so we don't need to parallelise > + * this step. Not entirely clear why we care about this, since we just > + * proceed directly to TSC synchronization which is the next sync > + * point with the AP anyway. > + */ > +static void wait_cpu_callin(unsigned int cpu) > +{ > + while (!cpumask_test_cpu(cpu, cpu_callin_mask)) > + schedule(); > +} > + > +/* > + * Bringup step four: Synchronize the TSC and wait for the target AP > + * to reach set_cpu_online() in start_secondary(). > + */ > +static void wait_cpu_online(unsigned int cpu) > { > unsigned long flags; > + > + /* > + * Check TSC synchronization with the AP (keep irqs disabled > + * while doing so): > + */ > + local_irq_save(flags); > + check_tsc_sync_source(cpu); > + local_irq_restore(flags); > + > + /* > + * Wait for the AP to mark itself online, so the core caller > + * can drop sparse_irq_lock. > + */ > + while (!cpu_online(cpu)) > + schedule(); > +} These schedule() loops make me itch... this is basically Ye Olde yield() loop with all it's known 'benefits'. Now, I don't think it's horribly broken, we're explicitly waiting on another CPU and can't have priority inversions, but yuck! It could all be somewhat cleaned up with wait_var_event{_timeout}() and wake_up_var(), but I'm really not sure that's worth it. But at least it requires a comment to justify.
On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote: > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: >> + /* >> + * Sync point with wait_cpu_callin(). The AP doesn't wait here >> + * but just sets the bit to let the controlling CPU (BSP) know that >> + * it's got this far. >> + */ >> smp_callin(); >> >> - /* otherwise gcc will move up smp_processor_id before the cpu_init */ >> + /* Otherwise gcc will move up smp_processor_id() before cpu_init() */ >> barrier(); > > Not to the detriment of this patch, but this barrier() and it's comment > seem weird vs smp_callin(). That function ends with an atomic bitop (it > has to, at the very least it must not be weaker than store-release) but > also has an explicit wmb() to order setup vs CPU_STARTING. > > (arguably that should be a full fence *AND* get a comment) > > There is no way the smp_processor_id() referred to in this comment can > land before cpu_init() even without the barrier(). Right. Let me clean that up. Thanks, tglx
On Tue, May 09 2023 at 12:19, Peter Zijlstra wrote: > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: >> @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c > > /* > * AP might wait on cpu_callout_mask in cpu_init() with > * cpu_initialized_mask set if previous attempt to online > * it timed-out. Clear cpu_initialized_mask so that after > * INIT/SIPI it could start with a clean state. > */ > cpumask_clear_cpu(cpu, cpu_initialized_mask); > smp_mb(); > > ^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and > implies much the same (this is x86 after all). If you want to be super > explicit about it write: > > smp_mb__after_atomic(); > > (which is a no-op) but then it still very much requires a comment as to > what exactly it orders against what. As this is gone a few patches later, I just be lazy and leave it alone.
On Tue, May 09 2023 at 12:31, Peter Zijlstra wrote: > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: >> + /* >> + * Wait for the AP to mark itself online, so the core caller >> + * can drop sparse_irq_lock. >> + */ >> + while (!cpu_online(cpu)) >> + schedule(); >> +} > > These schedule() loops make me itch... this is basically Ye Olde yield() > loop with all it's known 'benefits'. > > Now, I don't think it's horribly broken, we're explicitly waiting on > another CPU and can't have priority inversions, but yuck! > > It could all be somewhat cleaned up with wait_var_event{_timeout}() and > wake_up_var(), but I'm really not sure that's worth it. But at least it > requires a comment to justify. All of them are gone with the later patches and the control CPU will just go straight to wait for the completion in the core code.
On Tue, May 09 2023 at 14:07, Thomas Gleixner wrote: > On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote: >> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: >>> + /* >>> + * Sync point with wait_cpu_callin(). The AP doesn't wait here >>> + * but just sets the bit to let the controlling CPU (BSP) know that >>> + * it's got this far. >>> + */ >>> smp_callin(); >>> >>> - /* otherwise gcc will move up smp_processor_id before the cpu_init */ >>> + /* Otherwise gcc will move up smp_processor_id() before cpu_init() */ >>> barrier(); >> >> Not to the detriment of this patch, but this barrier() and it's comment >> seem weird vs smp_callin(). That function ends with an atomic bitop (it >> has to, at the very least it must not be weaker than store-release) but >> also has an explicit wmb() to order setup vs CPU_STARTING. >> >> (arguably that should be a full fence *AND* get a comment) >> >> There is no way the smp_processor_id() referred to in this comment can >> land before cpu_init() even without the barrier(). > > Right. Let me clean that up. So I went and tried to figure out where this comes from. It's from d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those well documented combo patches which change world and some more. The context back then was: /* * Dont put anything before smp_callin(), SMP * booting is too fragile that we want to limit the * things done here to the most necessary things. */ cpu_init(); smp_callin(); Dprintk("cpu %d: waiting for commence\n", smp_processor_id()); That still does not explain what the barrier is doing. I tried to harvest mailing list archives, but did not find anything. The back then list discuss@x86-64.org was never publicly archived... Boris gave me an tarball, but this 'barrier()' add on was nowhere discussed in public. As the barrier has no obvious value, I'm adding a patch upfront which removes it. Thanks, tglx
On Tue, May 09 2023 at 12:19, Peter Zijlstra wrote: > Again, not really this patch, but since I had to look at this code .... > > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: >> @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c > > /* > * AP might wait on cpu_callout_mask in cpu_init() with > * cpu_initialized_mask set if previous attempt to online > * it timed-out. Clear cpu_initialized_mask so that after > * INIT/SIPI it could start with a clean state. > */ > cpumask_clear_cpu(cpu, cpu_initialized_mask); > smp_mb(); > > ^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and > implies much the same (this is x86 after all). If you want to be super > explicit about it write: > > smp_mb__after_atomic(); > > (which is a no-op) but then it still very much requires a comment as to > what exactly it orders against what. Won't bother either as that mask is gone a few patches later.
On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote: > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: > Not to the detriment of this patch, but this barrier() and it's comment > seem weird vs smp_callin(). That function ends with an atomic bitop (it > has to, at the very least it must not be weaker than store-release) but > also has an explicit wmb() to order setup vs CPU_STARTING. > > (arguably that should be a full fence *AND* get a comment) TBH: I'm grasping for something 'arguable': What's the point of this wmb() or even a mb()? Most of the [w]mb()'s in smpboot.c except those in mwait_play_dead() have a very distinct voodoo programming smell. Thanks, tglx
On Tue, May 09, 2023 at 10:11:05PM +0200, Thomas Gleixner wrote: > On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote: > > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote: > > Not to the detriment of this patch, but this barrier() and it's comment > > seem weird vs smp_callin(). That function ends with an atomic bitop (it > > has to, at the very least it must not be weaker than store-release) but > > also has an explicit wmb() to order setup vs CPU_STARTING. > > > > (arguably that should be a full fence *AND* get a comment) > > TBH: I'm grasping for something 'arguable': What's the point of this > wmb() or even a mb()? > > Most of the [w]mb()'s in smpboot.c except those in mwait_play_dead() > have a very distinct voodoo programming smell. Oh fully agreed, esp. without a comment these things are hugely suspect. I could not immediately see purpose either. My arguably argument was about IF it was needed at all, then it would make more sense to me to also constrain loads. But I'd be more than happy to see the whole thing go. But perhaps not in this series?
--- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -193,6 +193,10 @@ static void smp_callin(void) wmb(); + /* + * This runs the AP through all the cpuhp states to its target + * state (CPUHP_ONLINE in the case of serial bringup). + */ notify_cpu_starting(cpuid); /* @@ -233,14 +237,31 @@ static void notrace start_secondary(void load_cr3(swapper_pg_dir); __flush_tlb_all(); #endif + /* + * Sync point with wait_cpu_initialized(). Before proceeding through + * cpu_init(), the AP will call wait_for_master_cpu() which sets its + * own bit in cpu_initialized_mask and then waits for the BSP to set + * its bit in cpu_callout_mask to release it. + */ cpu_init_secondary(); rcu_cpu_starting(raw_smp_processor_id()); x86_cpuinit.early_percpu_clock_init(); + + /* + * Sync point with wait_cpu_callin(). The AP doesn't wait here + * but just sets the bit to let the controlling CPU (BSP) know that + * it's got this far. + */ smp_callin(); - /* otherwise gcc will move up smp_processor_id before the cpu_init */ + /* Otherwise gcc will move up smp_processor_id() before cpu_init() */ barrier(); - /* Check TSC synchronization with the control CPU: */ + + /* + * Check TSC synchronization with the control CPU, which will do + * its part of this from wait_cpu_online(), making it an implicit + * synchronization point. + */ check_tsc_sync_target(); /* @@ -259,6 +280,7 @@ static void notrace start_secondary(void * half valid vector space. */ lock_vector_lock(); + /* Sync point with do_wait_cpu_online() */ set_cpu_online(smp_processor_id(), true); lapic_online(); unlock_vector_lock(); @@ -981,17 +1003,13 @@ int common_cpu_up(unsigned int cpu, stru /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad * (ie clustered apic addressing mode), this is a LOGICAL apic ID. - * Returns zero if CPU booted OK, else error code from + * Returns zero if startup was successfully sent, else error code from * ->wakeup_secondary_cpu. */ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle) { - /* start_ip had better be page-aligned! */ unsigned long start_ip = real_mode_header->trampoline_start; - unsigned long boot_error = 0; - unsigned long timeout; - #ifdef CONFIG_X86_64 /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */ if (apic->wakeup_secondary_cpu_64) @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c * - Use an INIT boot APIC message */ if (apic->wakeup_secondary_cpu_64) - boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip); + return apic->wakeup_secondary_cpu_64(apicid, start_ip); else if (apic->wakeup_secondary_cpu) - boot_error = apic->wakeup_secondary_cpu(apicid, start_ip); - else - boot_error = wakeup_secondary_cpu_via_init(apicid, start_ip); + return apic->wakeup_secondary_cpu(apicid, start_ip); - if (!boot_error) { - /* - * Wait 10s total for first sign of life from AP - */ - boot_error = -1; - timeout = jiffies + 10*HZ; - while (time_before(jiffies, timeout)) { - if (cpumask_test_cpu(cpu, cpu_initialized_mask)) { - /* - * Tell AP to proceed with initialization - */ - cpumask_set_cpu(cpu, cpu_callout_mask); - boot_error = 0; - break; - } - schedule(); - } - } + return wakeup_secondary_cpu_via_init(apicid, start_ip); +} - if (!boot_error) { - /* - * Wait till AP completes initial initialization - */ - while (!cpumask_test_cpu(cpu, cpu_callin_mask)) { - /* - * Allow other tasks to run while we wait for the - * AP to come online. This also gives a chance - * for the MTRR work(triggered by the AP coming online) - * to be completed in the stop machine context. - */ - schedule(); - } - } +static int wait_cpu_cpumask(unsigned int cpu, const struct cpumask *mask) +{ + unsigned long timeout; - if (x86_platform.legacy.warm_reset) { - /* - * Cleanup possible dangling ends... - */ - smpboot_restore_warm_reset_vector(); + /* + * Wait up to 10s for the CPU to report in. + */ + timeout = jiffies + 10*HZ; + while (time_before(jiffies, timeout)) { + if (cpumask_test_cpu(cpu, mask)) + return 0; + + schedule(); } + return -1; +} - return boot_error; +/* + * Bringup step two: Wait for the target AP to reach cpu_init_secondary() + * and thus wait_for_master_cpu(), then set cpu_callout_mask to allow it + * to proceed. The AP will then proceed past setting its 'callin' bit + * and end up waiting in check_tsc_sync_target() until we reach + * do_wait_cpu_online() to tend to it. + */ +static int wait_cpu_initialized(unsigned int cpu) +{ + /* + * Wait for first sign of life from AP. + */ + if (wait_cpu_cpumask(cpu, cpu_initialized_mask)) + return -1; + + cpumask_set_cpu(cpu, cpu_callout_mask); + return 0; } -int native_cpu_up(unsigned int cpu, struct task_struct *tidle) +/* + * Bringup step three: Wait for the target AP to reach smp_callin(). + * The AP is not waiting for us here so we don't need to parallelise + * this step. Not entirely clear why we care about this, since we just + * proceed directly to TSC synchronization which is the next sync + * point with the AP anyway. + */ +static void wait_cpu_callin(unsigned int cpu) +{ + while (!cpumask_test_cpu(cpu, cpu_callin_mask)) + schedule(); +} + +/* + * Bringup step four: Synchronize the TSC and wait for the target AP + * to reach set_cpu_online() in start_secondary(). + */ +static void wait_cpu_online(unsigned int cpu) { - int apicid = apic->cpu_present_to_apicid(cpu); unsigned long flags; + + /* + * Check TSC synchronization with the AP (keep irqs disabled + * while doing so): + */ + local_irq_save(flags); + check_tsc_sync_source(cpu); + local_irq_restore(flags); + + /* + * Wait for the AP to mark itself online, so the core caller + * can drop sparse_irq_lock. + */ + while (!cpu_online(cpu)) + schedule(); +} + +static int native_kick_ap(unsigned int cpu, struct task_struct *tidle) +{ + int apicid = apic->cpu_present_to_apicid(cpu); int err; lockdep_assert_irqs_enabled(); @@ -1142,25 +1189,33 @@ int native_cpu_up(unsigned int cpu, stru return err; err = do_boot_cpu(apicid, cpu, tidle); - if (err) { + if (err) pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu); - return err; - } - /* - * Check TSC synchronization with the AP (keep irqs disabled - * while doing so): - */ - local_irq_save(flags); - check_tsc_sync_source(cpu); - local_irq_restore(flags); + return err; +} - while (!cpu_online(cpu)) { - cpu_relax(); - touch_nmi_watchdog(); - } +int native_cpu_up(unsigned int cpu, struct task_struct *tidle) +{ + int ret; - return 0; + ret = native_kick_ap(cpu, tidle); + if (ret) + goto out; + + ret = wait_cpu_initialized(cpu); + if (ret) + goto out; + + wait_cpu_callin(cpu); + wait_cpu_online(cpu); + +out: + /* Cleanup possible dangling ends... */ + if (x86_platform.legacy.warm_reset) + smpboot_restore_warm_reset_vector(); + + return ret; } /**