diff mbox series

[XEN,9/9] x86/smp: start APs in parallel during boot

Message ID 77c9199eabf3a30ebcf89356b2dd35abd611a3a9.1699982111.git.krystian.hebel@3mdeb.com (mailing list archive)
State New, archived
Headers show
Series x86: parallelize AP bring-up during boot | expand

Commit Message

Krystian Hebel Nov. 14, 2023, 5:50 p.m. UTC
Multiple delays are required when sending IPIs and waiting for
responses. During boot, 4 such IPIs were sent per each AP. With this
change, only one set of broadcast IPIs is sent. This reduces boot time,
especially for platforms with large number of cores.

Single CPU initialization is still possible, it is used for hotplug.

During wakeup from S3 APs are started one by one. It should be possible
to enable parallel execution there as well, but I don't have a way of
properly testing it as of now.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/include/asm/smp.h |  1 +
 xen/arch/x86/setup.c           |  2 +
 xen/arch/x86/smpboot.c         | 68 ++++++++++++++++++++++++----------
 3 files changed, 51 insertions(+), 20 deletions(-)

Comments

Jan Beulich Feb. 8, 2024, 12:37 p.m. UTC | #1
On 14.11.2023 18:50, Krystian Hebel wrote:
> Multiple delays are required when sending IPIs and waiting for
> responses. During boot, 4 such IPIs were sent per each AP. With this
> change, only one set of broadcast IPIs is sent. This reduces boot time,
> especially for platforms with large number of cores.

Yet APs do their startup work in parallel only for a brief period of
time, if I'm not mistaken. Othwerwise I can't see why you'd still have
cpu_up() in __start_xen().

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>                  cpu_data[i].stack_base = cpu_alloc_stack(i);
>          }
>  
> +        smp_send_init_sipi_sipi_allbutself();
> +
>          for_each_present_cpu ( i )
>          {
>              if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&

So what about constraints on the number of CPUs to use? In such a case
you shouldn't send the IPI to all of them, at least if they're not
meant to be parked.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>  
>  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>  {
> -    unsigned long send_status = 0, accept_status = 0;
> +    unsigned long send_status = 0, accept_status = 0, sh = 0;

sh doesn't need to be 64 bits wide, does it?

>      int maxlvt, timeout, i;
>  
>      /*
> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>      if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>          return 0;
>  
> +    /*
> +     * Use destination shorthand for broadcasting IPIs during boot.
> +     */

Nit (style): This is a single line comment.

> +    if ( phys_apicid == BAD_APICID )
> +        sh = APIC_DEST_ALLBUT;

I think the latest for this the function parameter wants changing to
unsigned int (in another prereq patch).

> @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu)
>       */
>      mtrr_save_state();
>  
> -    start_eip = bootsym_phys(trampoline_realmode_entry);
> +    /* Check if AP is already up. */
> +    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
> +    {
> +        /* This grunge runs the startup process for the targeted processor. */
> +        unsigned long start_eip;
> +        start_eip = bootsym_phys(trampoline_realmode_entry);
>  
> -    /* start_eip needs be page aligned, and below the 1M boundary. */
> -    if ( start_eip & ~0xff000 )
> -        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
> +        /* start_eip needs be page aligned, and below the 1M boundary. */
> +        if ( start_eip & ~0xff000 )
> +            panic("AP trampoline %#lx not suitably positioned\n", start_eip);

Isn't this redundant now with the panic() in
smp_send_init_sipi_sipi_allbutself(), at least as long as that runs
unconditionally.

> -    /* So we see what's up   */
> -    if ( opt_cpu_info )
> -        printk("Booting processor %d/%d eip %lx\n",
> -               cpu, apicid, start_eip);
> +        /* So we see what's up   */
> +        if ( opt_cpu_info )
> +            printk("AP trampoline at %lx\n", start_eip);

Why this change in log message? It makes messages for individual CPUs
indistinguishable. And like above it's redundant with what
smp_send_init_sipi_sipi_allbutself() logs.

> -    /* This grunge runs the startup process for the targeted processor. */
> +        /* mark "stuck" area as not stuck */
> +        bootsym(trampoline_cpu_started) = 0;
> +        smp_mb();
>  
> -    /* Starting actual IPI sequence... */
> -    boot_error = wakeup_secondary_cpu(apicid, start_eip);
> +        /* Starting actual IPI sequence... */
> +        boot_error = wakeup_secondary_cpu(apicid, start_eip);
> +    }
> +
> +    if ( opt_cpu_info )
> +        printk("Booting processor %d/%d\n", cpu, apicid);

Oh, here's the other half. Yet for above it still doesn't make sense
to issue the same message for all CPUs.

> @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu)
>          rc = -EIO;
>      }
>  
> -    /* mark "stuck" area as not stuck */
> -    bootsym(trampoline_cpu_started) = 0;
> -    smp_mb();

While you move this up, it's not clear to me how you would now
identify individual stuck CPUs. I would have expected that this is
another global that needs converting up front, to be per-CPU.

> @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = {
>      .notifier_call = cpu_smpboot_callback
>  };
>  
> +void smp_send_init_sipi_sipi_allbutself(void)

__init?

> +{
> +    unsigned long start_eip;
> +    start_eip = bootsym_phys(trampoline_realmode_entry);

This can be the initializer of the variable, which would then save
me from complaining about the missing blank line between declaration
and statement(s). (Actually, as I notice only now - same for code you
move around in do_boot_cpu().)

Jan
Krystian Hebel March 12, 2024, 5:13 p.m. UTC | #2
On 8.02.2024 13:37, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> Multiple delays are required when sending IPIs and waiting for
>> responses. During boot, 4 such IPIs were sent per each AP. With this
>> change, only one set of broadcast IPIs is sent. This reduces boot time,
>> especially for platforms with large number of cores.
> Yet APs do their startup work in parallel only for a brief period of
> time, if I'm not mistaken. Othwerwise I can't see why you'd still have
> cpu_up() in __start_xen().
cpu_up() is left because multiple notifiers aren't easy to convert to work
in parallel. In terms of lines of code it looks like a brief period, but all
the delays along the way were taking much more time than the actual
work. As the gain was already more than what I hoped for, I decided
against spending too much time trying to fix the notifiers' code for
minimal profit.
>
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>                   cpu_data[i].stack_base = cpu_alloc_stack(i);
>>           }
>>   
>> +        smp_send_init_sipi_sipi_allbutself();
>> +
>>           for_each_present_cpu ( i )
>>           {
>>               if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&
> So what about constraints on the number of CPUs to use? In such a case
> you shouldn't send the IPI to all of them, at least if they're not
> meant to be parked.
Fair point, such check can be easily added before broadcasting and the
rest of the code should already be able to handle this.
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>>   
>>   static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>   {
>> -    unsigned long send_status = 0, accept_status = 0;
>> +    unsigned long send_status = 0, accept_status = 0, sh = 0;
> sh doesn't need to be 64 bits wide, does it?
No, will change.
>
>>       int maxlvt, timeout, i;
>>   
>>       /*
>> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>       if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>>           return 0;
>>   
>> +    /*
>> +     * Use destination shorthand for broadcasting IPIs during boot.
>> +     */
> Nit (style): This is a single line comment.
Ack
>
>> +    if ( phys_apicid == BAD_APICID )
>> +        sh = APIC_DEST_ALLBUT;
> I think the latest for this the function parameter wants changing to
> unsigned int (in another prereq patch).
What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed
as signed int since __cpu_up(), should I change all of those to unsigned?
>
>> @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu)
>>        */
>>       mtrr_save_state();
>>   
>> -    start_eip = bootsym_phys(trampoline_realmode_entry);
>> +    /* Check if AP is already up. */
>> +    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
>> +    {
>> +        /* This grunge runs the startup process for the targeted processor. */
>> +        unsigned long start_eip;
>> +        start_eip = bootsym_phys(trampoline_realmode_entry);
>>   
>> -    /* start_eip needs be page aligned, and below the 1M boundary. */
>> -    if ( start_eip & ~0xff000 )
>> -        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
>> +        /* start_eip needs be page aligned, and below the 1M boundary. */
>> +        if ( start_eip & ~0xff000 )
>> +            panic("AP trampoline %#lx not suitably positioned\n", start_eip);
> Isn't this redundant now with the panic() in
> smp_send_init_sipi_sipi_allbutself(), at least as long as that runs
> unconditionally.
Won't be running unconditionally, but it also wouldn't be redundant in case
of hot-plugging.
>
>> -    /* So we see what's up   */
>> -    if ( opt_cpu_info )
>> -        printk("Booting processor %d/%d eip %lx\n",
>> -               cpu, apicid, start_eip);
>> +        /* So we see what's up   */
>> +        if ( opt_cpu_info )
>> +            printk("AP trampoline at %lx\n", start_eip);
> Why this change in log message? It makes messages for individual CPUs
> indistinguishable. And like above it's redundant with what
> smp_send_init_sipi_sipi_allbutself() logs.
>
>> -    /* This grunge runs the startup process for the targeted processor. */
>> +        /* mark "stuck" area as not stuck */
>> +        bootsym(trampoline_cpu_started) = 0;
>> +        smp_mb();
>>   
>> -    /* Starting actual IPI sequence... */
>> -    boot_error = wakeup_secondary_cpu(apicid, start_eip);
>> +        /* Starting actual IPI sequence... */
>> +        boot_error = wakeup_secondary_cpu(apicid, start_eip);
>> +    }
>> +
>> +    if ( opt_cpu_info )
>> +        printk("Booting processor %d/%d\n", cpu, apicid);
> Oh, here's the other half. Yet for above it still doesn't make sense
> to issue the same message for all CPUs.
I'll undo it. It was important at one point for debugging, but I agree
that it doesn't make sense now.
>
>> @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu)
>>           rc = -EIO;
>>       }
>>   
>> -    /* mark "stuck" area as not stuck */
>> -    bootsym(trampoline_cpu_started) = 0;
>> -    smp_mb();
> While you move this up, it's not clear to me how you would now
> identify individual stuck CPUs. I would have expected that this is
> another global that needs converting up front, to be per-CPU.
In the existing code this is set very early, in 16b code, and the variable
is located within the first page of trampoline. With this change it is
impossible to identify individual stuck CPUs.

I was considering removing this variable altogether. Another option
would be to make this an array with NR_CPUS elements, but this may
get too big. It would be possible to fill this relatively early, after 
CPU ID
is obtained, before paging is enabled, but after loading IDT, GDT and
jumping to protected mode. Those are things that can break due to error
in the code, but it may be better than not having that info at all.

It could also be set from 64b code, that way simple per-CPU data would
work. It is a bit late, but this would probably be easiest and cleanest to
write.

>
>> @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = {
>>       .notifier_call = cpu_smpboot_callback
>>   };
>>   
>> +void smp_send_init_sipi_sipi_allbutself(void)
> __init?
Ack
>
>> +{
>> +    unsigned long start_eip;
>> +    start_eip = bootsym_phys(trampoline_realmode_entry);
> This can be the initializer of the variable, which would then save
> me from complaining about the missing blank line between declaration
> and statement(s). (Actually, as I notice only now - same for code you
> move around in do_boot_cpu().)
Will do. It may still be split due to line length, but at least that will
follow code style.
>
> Jan
Best regards,
Jan Beulich March 13, 2024, 1:30 p.m. UTC | #3
On 12.03.2024 18:13, Krystian Hebel wrote:
> 
> On 8.02.2024 13:37, Jan Beulich wrote:
>> On 14.11.2023 18:50, Krystian Hebel wrote:
>>> Multiple delays are required when sending IPIs and waiting for
>>> responses. During boot, 4 such IPIs were sent per each AP. With this
>>> change, only one set of broadcast IPIs is sent. This reduces boot time,
>>> especially for platforms with large number of cores.
>> Yet APs do their startup work in parallel only for a brief period of
>> time, if I'm not mistaken. Othwerwise I can't see why you'd still have
>> cpu_up() in __start_xen().
> cpu_up() is left because multiple notifiers aren't easy to convert to work
> in parallel. In terms of lines of code it looks like a brief period, but all
> the delays along the way were taking much more time than the actual
> work. As the gain was already more than what I hoped for, I decided
> against spending too much time trying to fix the notifiers' code for
> minimal profit.

Which is all fine. Just that by title of this patch and the cover letter
I expected more. Adding "partly" or some such in both places may help.

>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>>>   
>>>   static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>>   {
>>> -    unsigned long send_status = 0, accept_status = 0;
>>> +    unsigned long send_status = 0, accept_status = 0, sh = 0;
>> sh doesn't need to be 64 bits wide, does it?
> No, will change.
>>
>>>       int maxlvt, timeout, i;
>>>   
>>>       /*
>>> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>>       if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>>>           return 0;
>>>   
>>> +    /*
>>> +     * Use destination shorthand for broadcasting IPIs during boot.
>>> +     */
>> Nit (style): This is a single line comment.
> Ack
>>
>>> +    if ( phys_apicid == BAD_APICID )
>>> +        sh = APIC_DEST_ALLBUT;
>> I think the latest for this the function parameter wants changing to
>> unsigned int (in another prereq patch).
> What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed
> as signed int since __cpu_up(), should I change all of those to unsigned?

That would be best, yes. BAD_APICID, after all, is an unsigned constant
(no matter that its definition involves a unary minus operator).

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index 98739028a6ed..6ca0158a368d 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -31,6 +31,7 @@  DECLARE_PER_CPU(cpumask_var_t, send_ipi_cpumask);
 extern bool park_offline_cpus;
 
 void smp_send_nmi_allbutself(void);
+void smp_send_init_sipi_sipi_allbutself(void);
 
 void send_IPI_mask(const cpumask_t *, int vector);
 void send_IPI_self(int vector);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b2c0679725ea..42a9067b81eb 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1963,6 +1963,8 @@  void __init noreturn __start_xen(unsigned long mbi_p)
                 cpu_data[i].stack_base = cpu_alloc_stack(i);
         }
 
+        smp_send_init_sipi_sipi_allbutself();
+
         for_each_present_cpu ( i )
         {
             if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index cbea2d45f70d..e9a7f78a5a6f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -425,7 +425,7 @@  void start_secondary(unsigned int cpu)
 
 static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
 {
-    unsigned long send_status = 0, accept_status = 0;
+    unsigned long send_status = 0, accept_status = 0, sh = 0;
     int maxlvt, timeout, i;
 
     /*
@@ -445,6 +445,12 @@  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
     if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
         return 0;
 
+    /*
+     * Use destination shorthand for broadcasting IPIs during boot.
+     */
+    if ( phys_apicid == BAD_APICID )
+        sh = APIC_DEST_ALLBUT;
+
     /*
      * Be paranoid about clearing APIC errors.
      */
@@ -458,7 +464,7 @@  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
         /*
          * Turn INIT on target chip via IPI
          */
-        apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
+        apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT | sh,
                        phys_apicid);
 
         if ( !x2apic_enabled )
@@ -475,7 +481,7 @@  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
 
             Dprintk("Deasserting INIT.\n");
 
-            apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+            apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT | sh, phys_apicid);
 
             Dprintk("Waiting for send to finish...\n");
             timeout = 0;
@@ -512,7 +518,7 @@  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
          * STARTUP IPI
          * Boot on the stack
          */
-        apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12), phys_apicid);
+        apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12) | sh, phys_apicid);
 
         if ( !x2apic_enabled )
         {
@@ -565,7 +571,6 @@  int alloc_cpu_id(void)
 static int do_boot_cpu(int apicid, int cpu)
 {
     int timeout, boot_error = 0, rc = 0;
-    unsigned long start_eip;
 
     /*
      * Save current MTRR state in case it was changed since early boot
@@ -573,21 +578,31 @@  static int do_boot_cpu(int apicid, int cpu)
      */
     mtrr_save_state();
 
-    start_eip = bootsym_phys(trampoline_realmode_entry);
+    /* Check if AP is already up. */
+    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
+    {
+        /* This grunge runs the startup process for the targeted processor. */
+        unsigned long start_eip;
+        start_eip = bootsym_phys(trampoline_realmode_entry);
 
-    /* start_eip needs be page aligned, and below the 1M boundary. */
-    if ( start_eip & ~0xff000 )
-        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
+        /* start_eip needs be page aligned, and below the 1M boundary. */
+        if ( start_eip & ~0xff000 )
+            panic("AP trampoline %#lx not suitably positioned\n", start_eip);
 
-    /* So we see what's up   */
-    if ( opt_cpu_info )
-        printk("Booting processor %d/%d eip %lx\n",
-               cpu, apicid, start_eip);
+        /* So we see what's up   */
+        if ( opt_cpu_info )
+            printk("AP trampoline at %lx\n", start_eip);
 
-    /* This grunge runs the startup process for the targeted processor. */
+        /* mark "stuck" area as not stuck */
+        bootsym(trampoline_cpu_started) = 0;
+        smp_mb();
 
-    /* Starting actual IPI sequence... */
-    boot_error = wakeup_secondary_cpu(apicid, start_eip);
+        /* Starting actual IPI sequence... */
+        boot_error = wakeup_secondary_cpu(apicid, start_eip);
+    }
+
+    if ( opt_cpu_info )
+        printk("Booting processor %d/%d\n", cpu, apicid);
 
     if ( !boot_error )
     {
@@ -646,10 +661,6 @@  static int do_boot_cpu(int apicid, int cpu)
         rc = -EIO;
     }
 
-    /* mark "stuck" area as not stuck */
-    bootsym(trampoline_cpu_started) = 0;
-    smp_mb();
-
     return rc;
 }
 
@@ -1155,6 +1166,23 @@  static struct notifier_block cpu_smpboot_nfb = {
     .notifier_call = cpu_smpboot_callback
 };
 
+void smp_send_init_sipi_sipi_allbutself(void)
+{
+    unsigned long start_eip;
+    start_eip = bootsym_phys(trampoline_realmode_entry);
+
+    /* start_eip needs be page aligned, and below the 1M boundary. */
+    if ( start_eip & ~0xff000 )
+        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
+
+    /* So we see what's up   */
+    if ( opt_cpu_info )
+        printk("Booting APs in parallel, eip %lx\n", start_eip);
+
+    /* Starting actual broadcast IPI sequence... */
+    wakeup_secondary_cpu(BAD_APICID, start_eip);
+}
+
 void __init smp_prepare_cpus(void)
 {
     register_cpu_notifier(&cpu_smpboot_nfb);