mbox series

[0/9] Parallel CPU bringup for x86_64

Message ID 20230201204338.1337562-1-usama.arif@bytedance.com (mailing list archive)
Headers show
Series Parallel CPU bringup for x86_64 | expand

Message

Usama Arif Feb. 1, 2023, 8:43 p.m. UTC
This patchseries is from the work done by David Woodhouse (v4: https://lore.kernel.org/all/20220201205328.123066-1-dwmw2@infradead.org/).
The parallel CPU bringup is disabled for all AMD CPUs in this version: (see discussions: https://lore.kernel.org/all/bc3f2b1332c4bb77558df8aa36493a55542fe5b9.camel@infradead.org/ and
https://lore.kernel.org/all/3b6ac86fdc800cac5806433daf14a9095be101e9.camel@infradead.org/).

Doing INIT/SIPI/SIPI in parallel brings down the time for smpboot from ~700ms
to 100ms (85% improvement) on a server with 128 CPUs split across 2 NUMA
nodes.

Adding another cpuhp state for do_wait_cpu_initialized to make sure cpu_init
is reached in parallel as proposed by David in v1 will bring it down further
to ~30ms. Making this change would be dependent on this patchseries, so they
could be explored if this gets merged.

Changes across versions:
v2: Cut it back to just INIT/SIPI/SIPI in parallel for now, nothing more
v3: Clean up x2apic patch, add MTRR optimisation, lock topology update
    in preparation for more parallelisation.
v4: Fixes to the real mode parallelisation patch spotted by SeanC, to
    avoid scribbling on initial_gs in common_cpu_up(), and to allow all
    24 bits of the physical X2APIC ID to be used. That patch still needs
    a Signed-off-by from its original author, who once claimed not to
    remember writing it at all. But now we've fixed it, hopefully he'll
    admit it now :)
v5: rebase to v6.1 and remeasure performance, disable parallel bringup
    for AMD CPUs.

David Woodhouse (8):
  x86/apic/x2apic: Fix parallel handling of cluster_mask
  cpu/hotplug: Move idle_thread_get() to <linux/smpboot.h>
  cpu/hotplug: Add dynamic parallel bringup states before
    CPUHP_BRINGUP_CPU
  x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
  x86/smpboot: Split up native_cpu_up into separate phases and document
    them
  x86/smpboot: Send INIT/SIPI/SIPI to secondary CPUs in parallel
  x86/mtrr: Avoid repeated save of MTRRs on boot-time CPU bringup
  x86/smpboot: Serialize topology updates for secondary bringup

Thomas Gleixner (1):
  x86/smpboot: Support parallel startup of secondary CPUs

 arch/x86/include/asm/realmode.h       |   3 +
 arch/x86/include/asm/smp.h            |  13 +-
 arch/x86/include/asm/topology.h       |   2 -
 arch/x86/kernel/acpi/sleep.c          |   1 +
 arch/x86/kernel/apic/apic.c           |   2 +-
 arch/x86/kernel/apic/x2apic_cluster.c | 108 +++++----
 arch/x86/kernel/cpu/common.c          |   6 +-
 arch/x86/kernel/cpu/mtrr/mtrr.c       |   9 +
 arch/x86/kernel/head_64.S             |  73 ++++++
 arch/x86/kernel/smpboot.c             | 324 ++++++++++++++++++--------
 arch/x86/realmode/init.c              |   3 +
 arch/x86/realmode/rm/trampoline_64.S  |  14 ++
 arch/x86/xen/smp_pv.c                 |   4 +-
 include/linux/cpuhotplug.h            |   2 +
 include/linux/smpboot.h               |   7 +
 kernel/cpu.c                          |  27 ++-
 kernel/smpboot.c                      |   2 +-
 kernel/smpboot.h                      |   2 -
 18 files changed, 443 insertions(+), 159 deletions(-)

Comments

David Woodhouse Feb. 2, 2023, 10:02 a.m. UTC | #1
On Wed, 2023-02-01 at 20:43 +0000, Usama Arif wrote:
> This patchseries is from the work done by David Woodhouse (v4: https://lore.kernel.org/all/20220201205328.123066-1-dwmw2@infradead.org/).
> The parallel CPU bringup is disabled for all AMD CPUs in this version: (see discussions: https://lore.kernel.org/all/bc3f2b1332c4bb77558df8aa36493a55542fe5b9.camel@infradead.org/ and
> https://lore.kernel.org/all/3b6ac86fdc800cac5806433daf14a9095be101e9.camel@infradead.org/).
> 
> Doing INIT/SIPI/SIPI in parallel brings down the time for smpboot from ~700ms
> to 100ms (85% improvement) on a server with 128 CPUs split across 2 NUMA
> nodes.
> 
> Adding another cpuhp state for do_wait_cpu_initialized to make sure cpu_init
> is reached in parallel as proposed by David in v1 will bring it down further
> to ~30ms. Making this change would be dependent on this patchseries, so they
> could be explored if this gets merged.
> 
> Changes across versions:
> v2: Cut it back to just INIT/SIPI/SIPI in parallel for now, nothing more
> v3: Clean up x2apic patch, add MTRR optimisation, lock topology update
>     in preparation for more parallelisation.
> v4: Fixes to the real mode parallelisation patch spotted by SeanC, to
>     avoid scribbling on initial_gs in common_cpu_up(), and to allow all
>     24 bits of the physical X2APIC ID to be used. That patch still needs
>     a Signed-off-by from its original author, who once claimed not to
>     remember writing it at all. But now we've fixed it, hopefully he'll
>     admit it now :)
> v5: rebase to v6.1 and remeasure performance, disable parallel bringup
>     for AMD CPUs.

Thanks, Usama.

I've updated to v6.2-rc6 since there were a few more tweaks required
(and we should double-check that the new handling of cache_ap_init from
a dedicated cpuhp step works right if that ends up being done in
parallel).

I also fixed up the complaints from the test robot; including
<linux/smpboot.h> from smpboot.c and making do_cpu_up() static, and
putting #ifdef CONFIG_SMP around the 'are we booting the AP?' check and
code segment in head_64.S.

I've made the AMD thing a CPU bug as Peter suggested, and pushed it to
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-6.2-rc6
for you to do the real work of actually testing it :)
Usama Arif Feb. 2, 2023, 9:59 p.m. UTC | #2
On 02/02/2023 10:02, David Woodhouse wrote:
> On Wed, 2023-02-01 at 20:43 +0000, Usama Arif wrote:
>> This patchseries is from the work done by David Woodhouse (v4: https://lore.kernel.org/all/20220201205328.123066-1-dwmw2@infradead.org/).
>> The parallel CPU bringup is disabled for all AMD CPUs in this version: (see discussions: https://lore.kernel.org/all/bc3f2b1332c4bb77558df8aa36493a55542fe5b9.camel@infradead.org/ and
>> https://lore.kernel.org/all/3b6ac86fdc800cac5806433daf14a9095be101e9.camel@infradead.org/).
>>
>> Doing INIT/SIPI/SIPI in parallel brings down the time for smpboot from ~700ms
>> to 100ms (85% improvement) on a server with 128 CPUs split across 2 NUMA
>> nodes.
>>
>> Adding another cpuhp state for do_wait_cpu_initialized to make sure cpu_init
>> is reached in parallel as proposed by David in v1 will bring it down further
>> to ~30ms. Making this change would be dependent on this patchseries, so they
>> could be explored if this gets merged.
>>
>> Changes across versions:
>> v2: Cut it back to just INIT/SIPI/SIPI in parallel for now, nothing more
>> v3: Clean up x2apic patch, add MTRR optimisation, lock topology update
>>      in preparation for more parallelisation.
>> v4: Fixes to the real mode parallelisation patch spotted by SeanC, to
>>      avoid scribbling on initial_gs in common_cpu_up(), and to allow all
>>      24 bits of the physical X2APIC ID to be used. That patch still needs
>>      a Signed-off-by from its original author, who once claimed not to
>>      remember writing it at all. But now we've fixed it, hopefully he'll
>>      admit it now :)
>> v5: rebase to v6.1 and remeasure performance, disable parallel bringup
>>      for AMD CPUs.
> 
> Thanks, Usama.
> 
> I've updated to v6.2-rc6 since there were a few more tweaks required
> (and we should double-check that the new handling of cache_ap_init from
> a dedicated cpuhp step works right if that ends up being done in
> parallel).
> 
> I also fixed up the complaints from the test robot; including
> <linux/smpboot.h> from smpboot.c and making do_cpu_up() static, and
> putting #ifdef CONFIG_SMP around the 'are we booting the AP?' check and
> code segment in head_64.S.
> 
> I've made the AMD thing a CPU bug as Peter suggested, and pushed it to
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-6.2-rc6
> for you to do the real work of actually testing it :)

Thanks David! I have tested and reposted the v6.2-rc6 patches. One thing 
I was mistaken about since I had rebased the patches together was that 
the last 100ms to 30ms optimization was coming from parallelization in 
x86/cpu:wait-init, when it seems to have a negligible affect. The last 
70ms optimization was coming mainly from reusing timer calibration. Its 
a simple patch and I have added it at the end of the series. The only 
thing thats' missing was a sign-off from the author who I have added to 
the latest series.