diff mbox series

[v7,08/10] x86/microcode: Synchronize late microcode loading

Message ID 1558945891-3015-9-git-send-email-chao.gao@intel.com (mailing list archive)
State Superseded
Headers show
Series improve late microcode loading | expand

Commit Message

Chao Gao May 27, 2019, 8:31 a.m. UTC
This patch ports microcode improvement patches from linux kernel.

Before you read any further: the early loading method is still the
preferred one and you should always do that. The following patch is
improving the late loading mechanism for long running jobs and cloud use
cases.

Gather all cores and serialize the microcode update on them by doing it
one-by-one to make the late update process as reliable as possible and
avoid potential issues caused by the microcode update.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
[linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
[linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes in v7:
 - Check whether 'timeout' is 0 rather than "<=0" since it is unsigned int.
 - reword the comment above microcode_update_cpu() to clearly state that
 one thread per core should do the update.

Changes in v6:
 - Use one timeout period for rendezvous stage and another for update stage.
 - scale time to wait by the number of remaining cpus to respond.
   It helps to find something wrong earlier and thus we can reboot the
   system earlier.
---
 xen/arch/x86/microcode.c | 171 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 155 insertions(+), 16 deletions(-)

Comments

Jan Beulich June 5, 2019, 2:09 p.m. UTC | #1
>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
> This patch ports microcode improvement patches from linux kernel.
> 
> Before you read any further: the early loading method is still the
> preferred one and you should always do that. The following patch is
> improving the late loading mechanism for long running jobs and cloud use
> cases.
> 
> Gather all cores and serialize the microcode update on them by doing it
> one-by-one to make the late update process as reliable as possible and
> avoid potential issues caused by the microcode update.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Tested-by: Chao Gao <chao.gao@intel.com>
> [linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
> [linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> ---
> Changes in v7:
>  - Check whether 'timeout' is 0 rather than "<=0" since it is unsigned int.
>  - reword the comment above microcode_update_cpu() to clearly state that
>  one thread per core should do the update.
> 
> Changes in v6:
>  - Use one timeout period for rendezvous stage and another for update stage.
>  - scale time to wait by the number of remaining cpus to respond.
>    It helps to find something wrong earlier and thus we can reboot the
>    system earlier.
> ---
>  xen/arch/x86/microcode.c | 171 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 155 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
> index 23cf550..f4a417e 100644
> --- a/xen/arch/x86/microcode.c
> +++ b/xen/arch/x86/microcode.c
> @@ -22,6 +22,7 @@
>   */
>  
>  #include <xen/cpu.h>
> +#include <xen/cpumask.h>

It seems vanishingly unlikely that you would need this explicit #include
here, but it certainly isn't wrong.

> @@ -270,31 +296,90 @@ bool microcode_update_cache(struct microcode_patch *patch)
>      return true;
>  }
>  
> -static long do_microcode_update(void *patch)
> +/* Wait for CPUs to rendezvous with a timeout (us) */
> +static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
> +                         unsigned int timeout)
>  {
> -    int error, cpu;
> -
> -    error = microcode_update_cpu(patch);
> -    if ( error )
> +    while ( atomic_read(cnt) < expect )
>      {
> -        microcode_ops->free_patch(microcode_cache);
> -        return error;
> +        if ( !timeout )
> +        {
> +            printk("CPU%d: Timeout when waiting for CPUs calling in\n",
> +                   smp_processor_id());
> +            return -EBUSY;
> +        }
> +        udelay(1);
> +        timeout--;
>      }

There's no comment here and nothing in the description: I don't
recall clarification as to whether RDTSC is fine to be issued by a
thread when ucode is being updated by another thread on the
same core.

> +static int do_microcode_update(void *patch)
> +{
> +    unsigned int cpu = smp_processor_id();
> +    unsigned int cpu_nr = num_online_cpus();
> +    unsigned int finished;
> +    int ret;
> +    static bool error;
>  
> -    microcode_update_cache(patch);
> +    atomic_inc(&cpu_in);
> +    ret = wait_for_cpus(&cpu_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
> +    if ( ret )
> +        return ret;
>  
> -    return error;
> +    ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
> +    /*
> +     * Load microcode update on only one logical processor per core.
> +     * Here, among logical processors of a core, the one with the
> +     * lowest thread id is chosen to perform the loading.
> +     */
> +    if ( !ret && (cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu))) )

At the very least it's not obvious whether this hyper-threading-centric
view ("logical processor") also applies to AMD's compute unit model
(which reuses cpu_sibling_mask). It does, as the respective MSRs are
per-compute-unit rather than per-core, but I'd appreciate if the
wording could be adjusted to explicitly name both cases (multiple
threads per core and multiple cores per CU).

> +    {
> +        ret = microcode_ops->apply_microcode(patch);
> +        if ( !ret )
> +            atomic_inc(&cpu_updated);
> +    }
> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing

I'm struggling with the "increase": I don't see anything being increased
here. You simply use a larger timeout than above.

> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short
> +     */
> +    atomic_inc(&cpu_out);
> +    finished = atomic_read(&cpu_out);
> +    while ( !error && finished != cpu_nr )
> +    {
> +        /*
> +         * During each timeout interval, at least a CPU is expected to
> +         * finish its update. Otherwise, something goes wrong.
> +         */
> +        if ( wait_for_cpus(&cpu_out, finished + 1,
> +                           MICROCODE_UPDATE_TIMEOUT_US) && !error )
> +        {
> +            error = true;
> +            panic("Timeout when finishing updating microcode (finished %d/%d)",
> +                  finished, cpu_nr);

Why the setting of "error" when you panic anyway?

And please use format specifiers matching the types of the
further arguments (i.e. twice %u here, but please check other
code as well).

Furthermore (and I'm sure I've given this comment before) if
you really hit the limit, how many panic() invocations are there
going to be? You run this function on all CPUs after all.

On the whole, taking a 256-thread system as example, you
allow the whole process to take over 4 min without calling
panic(). Leaving aside guests, I don't think Xen itself would
survive this in all cases. We've found the need to process
softirqs with far smaller delays, in particular from key handlers
producing lots of output. At the very least there should be a
bold warning logged if the system had been in stop-machine
state for, say, longer than 100ms (value subject to discussion).

> +        }
> +
> +        finished = atomic_read(&cpu_out);
> +    }
> +
> +    /*
> +     * Refresh CPU signature (revision) on threads which didn't call
> +     * apply_microcode().
> +     */
> +    if ( cpu != cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +        ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));

Another option would be for the CPU doing the update to simply
propagate the new value to all its siblings' cpu_sig values.

> @@ -337,12 +429,59 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>          if ( patch )
>              microcode_ops->free_patch(patch);
>          ret = -EINVAL;
> -        goto free;
> +        goto put;
>      }
>  
> -    ret = continue_hypercall_on_cpu(cpumask_first(&cpu_online_map),
> -                                    do_microcode_update, patch);
> +    atomic_set(&cpu_in, 0);
> +    atomic_set(&cpu_out, 0);
> +    atomic_set(&cpu_updated, 0);
> +
> +    /* Calculate the number of online CPU core */
> +    nr_cores = 0;
> +    for_each_online_cpu(cpu)
> +        if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +            nr_cores++;
> +
> +    printk(XENLOG_INFO "%d cores are to update their microcode\n", nr_cores);
> +
> +    /*
> +     * We intend to disable interrupt for long time, which may lead to
> +     * watchdog timeout.
> +     */
> +    watchdog_disable();
> +    /*
> +     * Late loading dance. Why the heavy-handed stop_machine effort?
> +     *
> +     * - HT siblings must be idle and not execute other code while the other
> +     *   sibling is loading microcode in order to avoid any negative
> +     *   interactions cause by the loading.
> +     *
> +     * - In addition, microcode update on the cores must be serialized until
> +     *   this requirement can be relaxed in the future. Right now, this is
> +     *   conservative and good.
> +     */
> +    ret = stop_machine_run(do_microcode_update, patch, NR_CPUS);
> +    watchdog_enable();
> +
> +    if ( atomic_read(&cpu_updated) == nr_cores )
> +    {
> +        spin_lock(&microcode_mutex);
> +        microcode_update_cache(patch);
> +        spin_unlock(&microcode_mutex);
> +    }
> +    else if ( atomic_read(&cpu_updated) == 0 )
> +        microcode_ops->free_patch(patch);
> +    else
> +    {
> +        printk("Updating microcode succeeded on part of CPUs and failed on\n"
> +               "others due to an unknown reason. A system with different\n"
> +               "microcode revisions is considered unstable. Please reboot and\n"
> +               "do not load the microcode that triggers this warning\n");
> +        microcode_ops->free_patch(patch);
> +    }

As said on an earlier patch, I think the cache can be updated if at
least one CPU loaded the blob successfully. Additionally I'd like to
ask that you log the number of successfully updated cores. And
finally perhaps "differing" instead of "different" and omit "due to
an unknown reason"?

Jan
Roger Pau Monne June 5, 2019, 2:42 p.m. UTC | #2
On Mon, May 27, 2019 at 04:31:29PM +0800, Chao Gao wrote:
> This patch ports microcode improvement patches from linux kernel.
> 
> Before you read any further: the early loading method is still the
> preferred one and you should always do that. The following patch is
> improving the late loading mechanism for long running jobs and cloud use
> cases.
> 
> Gather all cores and serialize the microcode update on them by doing it
> one-by-one to make the late update process as reliable as possible and
> avoid potential issues caused by the microcode update.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Tested-by: Chao Gao <chao.gao@intel.com>
> [linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
> [linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> ---
> Changes in v7:
>  - Check whether 'timeout' is 0 rather than "<=0" since it is unsigned int.
>  - reword the comment above microcode_update_cpu() to clearly state that
>  one thread per core should do the update.
> 
> Changes in v6:
>  - Use one timeout period for rendezvous stage and another for update stage.
>  - scale time to wait by the number of remaining cpus to respond.
>    It helps to find something wrong earlier and thus we can reboot the
>    system earlier.
> ---
>  xen/arch/x86/microcode.c | 171 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 155 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
> index 23cf550..f4a417e 100644
> --- a/xen/arch/x86/microcode.c
> +++ b/xen/arch/x86/microcode.c
> @@ -22,6 +22,7 @@
>   */
>  
>  #include <xen/cpu.h>
> +#include <xen/cpumask.h>
>  #include <xen/lib.h>
>  #include <xen/kernel.h>
>  #include <xen/init.h>
> @@ -30,15 +31,34 @@
>  #include <xen/smp.h>
>  #include <xen/softirq.h>
>  #include <xen/spinlock.h>
> +#include <xen/stop_machine.h>
>  #include <xen/tasklet.h>
>  #include <xen/guest_access.h>
>  #include <xen/earlycpio.h>
> +#include <xen/watchdog.h>
>  
> +#include <asm/delay.h>
>  #include <asm/msr.h>
>  #include <asm/processor.h>
>  #include <asm/setup.h>
>  #include <asm/microcode.h>
>  
> +/*
> + * Before performing a late microcode update on any thread, we
> + * rendezvous all cpus in stop_machine context. The timeout for
> + * waiting for cpu rendezvous is 30ms. It is the timeout used by
> + * live patching
> + */
> +#define MICROCODE_CALLIN_TIMEOUT_US 30000
> +
> +/*
> + * Timeout for each thread to complete update is set to 1s. It is a
> + * conservative choice considering all possible interference (for
> + * instance, sometimes wbinvd takes relative long time). And a perfect
> + * timeout doesn't help a lot except an early shutdown.

I would remove the "And a perfect..." sentence. I don't think it makes
much sense to speak about "perfect timeouts".

> + */
> +#define MICROCODE_UPDATE_TIMEOUT_US 1000000
> +
>  static module_t __initdata ucode_mod;
>  static signed int __initdata ucode_mod_idx;
>  static bool_t __initdata ucode_mod_forced;
> @@ -190,6 +210,12 @@ static DEFINE_SPINLOCK(microcode_mutex);
>  DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
>  
>  /*
> + * Count the CPUs that have entered, exited the rendezvous and succeeded in
> + * microcode update during late microcode update respectively.
> + */
> +static atomic_t cpu_in, cpu_out, cpu_updated;
> +
> +/*
>   * Return the patch with the highest revision id among all matching
>   * patches in the blob. Return NULL if no suitable patch.
>   */
> @@ -270,31 +296,90 @@ bool microcode_update_cache(struct microcode_patch *patch)
>      return true;
>  }
>  
> -static long do_microcode_update(void *patch)
> +/* Wait for CPUs to rendezvous with a timeout (us) */
> +static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
> +                         unsigned int timeout)
>  {
> -    int error, cpu;
> -
> -    error = microcode_update_cpu(patch);
> -    if ( error )
> +    while ( atomic_read(cnt) < expect )
>      {
> -        microcode_ops->free_patch(microcode_cache);
> -        return error;
> +        if ( !timeout )
> +        {
> +            printk("CPU%d: Timeout when waiting for CPUs calling in\n",
> +                   smp_processor_id());
> +            return -EBUSY;
> +        }
> +        udelay(1);
> +        timeout--;

Nit: you could do the decrement inside the if condition.

>      }
>  
> +    return 0;
> +}
>  
> -    cpu = cpumask_next(smp_processor_id(), &cpu_online_map);
> -    if ( cpu < nr_cpu_ids )
> -        return continue_hypercall_on_cpu(cpu, do_microcode_update, patch);
> +static int do_microcode_update(void *patch)
> +{
> +    unsigned int cpu = smp_processor_id();
> +    unsigned int cpu_nr = num_online_cpus();
> +    unsigned int finished;
> +    int ret;
> +    static bool error;
>  
> -    microcode_update_cache(patch);
> +    atomic_inc(&cpu_in);
> +    ret = wait_for_cpus(&cpu_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
> +    if ( ret )
> +        return ret;
>  
> -    return error;
> +    ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
> +    /*
> +     * Load microcode update on only one logical processor per core.
> +     * Here, among logical processors of a core, the one with the
> +     * lowest thread id is chosen to perform the loading.
> +     */
> +    if ( !ret && (cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu))) )
> +    {
> +        ret = microcode_ops->apply_microcode(patch);
> +        if ( !ret )
> +            atomic_inc(&cpu_updated);
> +    }
> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing
> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short

It's likely me missing something, but where is this serialization
being done?

I assume it's done by apply_microcode because do_microcode_update
doesn't do any serialization of microcode loading.

> +     */
> +    atomic_inc(&cpu_out);
> +    finished = atomic_read(&cpu_out);
> +    while ( !error && finished != cpu_nr )
> +    {
> +        /*
> +         * During each timeout interval, at least a CPU is expected to
> +         * finish its update. Otherwise, something goes wrong.
> +         */
> +        if ( wait_for_cpus(&cpu_out, finished + 1,
> +                           MICROCODE_UPDATE_TIMEOUT_US) && !error )
> +        {
> +            error = true;

I'm not sure I see the point of the error variable, you already bring
the system down with panic. If the intention is to prevent multiple
panics from different threads then you need to use some kind of
atomic fetch and set or else the code is racy.

> +            panic("Timeout when finishing updating microcode (finished %d/%d)",

Both finished and cpu_nr are unsigned ints, hence you should use %u
instead of %d.

> +                  finished, cpu_nr);
> +        }

This whole loop seems to be designed for serialized microcode
application, which is not the case with the current implementation
where microcode is updated in parallel on all the cores?

IMO you should just wait for MICROCODE_UPDATE_TIMEOUT_US a single
time.

> +        finished = atomic_read(&cpu_out);
> +    }
> +
> +    /*
> +     * Refresh CPU signature (revision) on threads which didn't call
> +     * apply_microcode().
> +     */
> +    if ( cpu != cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +        ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
> +
> +    return ret;
>  }
>  
>  int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>  {
>      int ret;
>      void *buffer;
> +    unsigned int cpu, nr_cores;
>      struct microcode_patch *patch;
>  
>      if ( len != (uint32_t)len )
> @@ -316,11 +401,18 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>          goto free;
>      }
>  
> +    /* cpu_online_map must not change during update */
> +    if ( !get_cpu_maps() )
> +    {
> +        ret = -EBUSY;
> +        goto free;
> +    }
> +
>      if ( microcode_ops->start_update )
>      {
>          ret = microcode_ops->start_update();
>          if ( ret != 0 )
> -            goto free;
> +            goto put;
>      }
>  
>      patch = microcode_parse_blob(buffer, len);
> @@ -337,12 +429,59 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>          if ( patch )
>              microcode_ops->free_patch(patch);
>          ret = -EINVAL;
> -        goto free;
> +        goto put;
>      }
>  
> -    ret = continue_hypercall_on_cpu(cpumask_first(&cpu_online_map),
> -                                    do_microcode_update, patch);
> +    atomic_set(&cpu_in, 0);
> +    atomic_set(&cpu_out, 0);
> +    atomic_set(&cpu_updated, 0);
> +
> +    /* Calculate the number of online CPU core */
> +    nr_cores = 0;
> +    for_each_online_cpu(cpu)
> +        if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +            nr_cores++;
> +
> +    printk(XENLOG_INFO "%d cores are to update their microcode\n", nr_cores);

Same here, nr_cores is unsigned.

> +
> +    /*
> +     * We intend to disable interrupt for long time, which may lead to
> +     * watchdog timeout.
> +     */
> +    watchdog_disable();
> +    /*
> +     * Late loading dance. Why the heavy-handed stop_machine effort?
> +     *
> +     * - HT siblings must be idle and not execute other code while the other
> +     *   sibling is loading microcode in order to avoid any negative
> +     *   interactions cause by the loading.
> +     *
> +     * - In addition, microcode update on the cores must be serialized until
> +     *   this requirement can be relaxed in the future. Right now, this is
> +     *   conservative and good.
> +     */
> +    ret = stop_machine_run(do_microcode_update, patch, NR_CPUS);
> +    watchdog_enable();
> +
> +    if ( atomic_read(&cpu_updated) == nr_cores )
> +    {
> +        spin_lock(&microcode_mutex);
> +        microcode_update_cache(patch);
> +        spin_unlock(&microcode_mutex);
> +    }
> +    else if ( atomic_read(&cpu_updated) == 0 )
> +        microcode_ops->free_patch(patch);
> +    else
> +    {
> +        printk("Updating microcode succeeded on part of CPUs and failed on\n"

I would prefix this with XENLOG_ERR and an explicit "ERROR: " prefix
in the format string.

Thanks, Roger.
Chao Gao June 11, 2019, 12:36 p.m. UTC | #3
On Wed, Jun 05, 2019 at 08:09:43AM -0600, Jan Beulich wrote:
>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>> This patch ports microcode improvement patches from linux kernel.
>> 
>> Before you read any further: the early loading method is still the
>> preferred one and you should always do that. The following patch is
>> improving the late loading mechanism for long running jobs and cloud use
>> cases.
>> 
>> Gather all cores and serialize the microcode update on them by doing it
>> one-by-one to make the late update process as reliable as possible and
>> avoid potential issues caused by the microcode update.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> Tested-by: Chao Gao <chao.gao@intel.com>
>> [linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
>> [linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> ---
>> Changes in v7:
>>  - Check whether 'timeout' is 0 rather than "<=0" since it is unsigned int.
>>  - reword the comment above microcode_update_cpu() to clearly state that
>>  one thread per core should do the update.
>> 
>> Changes in v6:
>>  - Use one timeout period for rendezvous stage and another for update stage.
>>  - scale time to wait by the number of remaining cpus to respond.
>>    It helps to find something wrong earlier and thus we can reboot the
>>    system earlier.
>> ---
>>  xen/arch/x86/microcode.c | 171 ++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 155 insertions(+), 16 deletions(-)
>> 
>> diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
>> index 23cf550..f4a417e 100644
>> --- a/xen/arch/x86/microcode.c
>> +++ b/xen/arch/x86/microcode.c
>> @@ -22,6 +22,7 @@
>>   */
>>  
>>  #include <xen/cpu.h>
>> +#include <xen/cpumask.h>
>
>It seems vanishingly unlikely that you would need this explicit #include
>here, but it certainly isn't wrong.
>
>> @@ -270,31 +296,90 @@ bool microcode_update_cache(struct microcode_patch *patch)
>>      return true;
>>  }
>>  
>> -static long do_microcode_update(void *patch)
>> +/* Wait for CPUs to rendezvous with a timeout (us) */
>> +static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
>> +                         unsigned int timeout)
>>  {
>> -    int error, cpu;
>> -
>> -    error = microcode_update_cpu(patch);
>> -    if ( error )
>> +    while ( atomic_read(cnt) < expect )
>>      {
>> -        microcode_ops->free_patch(microcode_cache);
>> -        return error;
>> +        if ( !timeout )
>> +        {
>> +            printk("CPU%d: Timeout when waiting for CPUs calling in\n",
>> +                   smp_processor_id());
>> +            return -EBUSY;
>> +        }
>> +        udelay(1);
>> +        timeout--;
>>      }
>
>There's no comment here and nothing in the description: I don't
>recall clarification as to whether RDTSC is fine to be issued by a
>thread when ucode is being updated by another thread on the
>same core.

Yes. I think it is fine.

Ashok, could you share your opinion on this question?

>
>> +static int do_microcode_update(void *patch)
>> +{
>> +    unsigned int cpu = smp_processor_id();
>> +    unsigned int cpu_nr = num_online_cpus();
>> +    unsigned int finished;
>> +    int ret;
>> +    static bool error;
>>  
>> -    microcode_update_cache(patch);
>> +    atomic_inc(&cpu_in);
>> +    ret = wait_for_cpus(&cpu_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
>> +    if ( ret )
>> +        return ret;
>>  
>> -    return error;
>> +    ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>> +    /*
>> +     * Load microcode update on only one logical processor per core.
>> +     * Here, among logical processors of a core, the one with the
>> +     * lowest thread id is chosen to perform the loading.
>> +     */
>> +    if ( !ret && (cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu))) )
>
>At the very least it's not obvious whether this hyper-threading-centric
>view ("logical processor") also applies to AMD's compute unit model
>(which reuses cpu_sibling_mask). It does, as the respective MSRs are
>per-compute-unit rather than per-core, but I'd appreciate if the
>wording could be adjusted to explicitly name both cases (multiple
>threads per core and multiple cores per CU).

OK. Will do

>
>> +    {
>> +        ret = microcode_ops->apply_microcode(patch);
>> +        if ( !ret )
>> +            atomic_inc(&cpu_updated);
>> +    }
>> +    /*
>> +     * Increase the wait timeout to a safe value here since we're serializing
>
>I'm struggling with the "increase": I don't see anything being increased
>here. You simply use a larger timeout than above.
>
>> +     * the microcode update and that could take a while on a large number of
>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>> +     * the last CPU finished updating and thus cut short
>> +     */
>> +    atomic_inc(&cpu_out);
>> +    finished = atomic_read(&cpu_out);
>> +    while ( !error && finished != cpu_nr )
>> +    {
>> +        /*
>> +         * During each timeout interval, at least a CPU is expected to
>> +         * finish its update. Otherwise, something goes wrong.
>> +         */
>> +        if ( wait_for_cpus(&cpu_out, finished + 1,
>> +                           MICROCODE_UPDATE_TIMEOUT_US) && !error )
>> +        {
>> +            error = true;
>> +            panic("Timeout when finishing updating microcode (finished %d/%d)",
>> +                  finished, cpu_nr);
>
>Why the setting of "error" when you panic anyway?
>
>And please use format specifiers matching the types of the
>further arguments (i.e. twice %u here, but please check other
>code as well).
>
>Furthermore (and I'm sure I've given this comment before) if
>you really hit the limit, how many panic() invocations are there
>going to be? You run this function on all CPUs after all.

"error" is to avoid calling of panic() on multiple CPUs simultaneously.
Roger is right: atomic primitives should be used here.

>
>On the whole, taking a 256-thread system as example, you
>allow the whole process to take over 4 min without calling
>panic().
>Leaving aside guests, I don't think Xen itself would
>survive this in all cases. We've found the need to process
>softirqs with far smaller delays, in particular from key handlers
>producing lots of output. At the very least there should be a
>bold warning logged if the system had been in stop-machine
>state for, say, longer than 100ms (value subject to discussion).
>

In theory, if you mean 256 cores, yes. Do you think a configurable and
run-time changeable upper bound for the whole process can address your
concern? The default value for this upper bound can be set to a large
value (for example, 1s * the number of online core) and the admin can
ajust/lower the upper bound according to the way (serial or parallel) to
perform the update and other requirements. Once the upper bound is
reached, we would call panic().

>> +        }
>> +
>> +        finished = atomic_read(&cpu_out);
>> +    }
>> +
>> +    /*
>> +     * Refresh CPU signature (revision) on threads which didn't call
>> +     * apply_microcode().
>> +     */
>> +    if ( cpu != cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>> +        ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>
>Another option would be for the CPU doing the update to simply
>propagate the new value to all its siblings' cpu_sig values.

Will do.

>
>> @@ -337,12 +429,59 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>>          if ( patch )
>>              microcode_ops->free_patch(patch);
>>          ret = -EINVAL;
>> -        goto free;
>> +        goto put;
>>      }
>>  
>> -    ret = continue_hypercall_on_cpu(cpumask_first(&cpu_online_map),
>> -                                    do_microcode_update, patch);
>> +    atomic_set(&cpu_in, 0);
>> +    atomic_set(&cpu_out, 0);
>> +    atomic_set(&cpu_updated, 0);
>> +
>> +    /* Calculate the number of online CPU core */
>> +    nr_cores = 0;
>> +    for_each_online_cpu(cpu)
>> +        if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>> +            nr_cores++;
>> +
>> +    printk(XENLOG_INFO "%d cores are to update their microcode\n", nr_cores);
>> +
>> +    /*
>> +     * We intend to disable interrupt for long time, which may lead to
>> +     * watchdog timeout.
>> +     */
>> +    watchdog_disable();
>> +    /*
>> +     * Late loading dance. Why the heavy-handed stop_machine effort?
>> +     *
>> +     * - HT siblings must be idle and not execute other code while the other
>> +     *   sibling is loading microcode in order to avoid any negative
>> +     *   interactions cause by the loading.
>> +     *
>> +     * - In addition, microcode update on the cores must be serialized until
>> +     *   this requirement can be relaxed in the future. Right now, this is
>> +     *   conservative and good.
>> +     */
>> +    ret = stop_machine_run(do_microcode_update, patch, NR_CPUS);
>> +    watchdog_enable();
>> +
>> +    if ( atomic_read(&cpu_updated) == nr_cores )
>> +    {
>> +        spin_lock(&microcode_mutex);
>> +        microcode_update_cache(patch);
>> +        spin_unlock(&microcode_mutex);
>> +    }
>> +    else if ( atomic_read(&cpu_updated) == 0 )
>> +        microcode_ops->free_patch(patch);
>> +    else
>> +    {
>> +        printk("Updating microcode succeeded on part of CPUs and failed on\n"
>> +               "others due to an unknown reason. A system with different\n"
>> +               "microcode revisions is considered unstable. Please reboot and\n"
>> +               "do not load the microcode that triggers this warning\n");
>> +        microcode_ops->free_patch(patch);
>> +    }
>
>As said on an earlier patch, I think the cache can be updated if at
>least one CPU loaded the blob successfully. Additionally I'd like to
>ask that you log the number of successfully updated cores. And
>finally perhaps "differing" instead of "different" and omit "due to
>an unknown reason"?

Will do.

Thanks
Chao
Jan Beulich June 11, 2019, 12:58 p.m. UTC | #4
>>> On 11.06.19 at 14:36, <chao.gao@intel.com> wrote:
> On Wed, Jun 05, 2019 at 08:09:43AM -0600, Jan Beulich wrote:
>>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>>On the whole, taking a 256-thread system as example, you
>>allow the whole process to take over 4 min without calling
>>panic().
>>Leaving aside guests, I don't think Xen itself would
>>survive this in all cases. We've found the need to process
>>softirqs with far smaller delays, in particular from key handlers
>>producing lots of output. At the very least there should be a
>>bold warning logged if the system had been in stop-machine
>>state for, say, longer than 100ms (value subject to discussion).
> 
> In theory, if you mean 256 cores, yes. Do you think a configurable and
> run-time changeable upper bound for the whole process can address your
> concern? The default value for this upper bound can be set to a large
> value (for example, 1s * the number of online core) and the admin can
> ajust/lower the upper bound according to the way (serial or parallel) to
> perform the update and other requirements. Once the upper bound is
> reached, we would call panic().

Well, a command line option to control the total time until
calling panic() may help, but as you've said in the past: If we
panic anyway, it doesn't matter much what the timeout is. My
point was rather to make explicit that the process may have
completed after a (too) long time. Remember you mean this
late loading to happen with guests running. We should avoid
making the system unstable as much as we can. This includes
this taking long and the completing successfully _as well as_
calling panic().

Jan
Ashok Raj June 11, 2019, 3:47 p.m. UTC | #5
Hi Gao, Jan

On Tue, Jun 11, 2019 at 08:36:17PM +0800, Chao Gao wrote:
> On Wed, Jun 05, 2019 at 08:09:43AM -0600, Jan Beulich wrote:
> >
> >There's no comment here and nothing in the description: I don't
> >recall clarification as to whether RDTSC is fine to be issued by a
> >thread when ucode is being updated by another thread on the
> >same core.
> 
> Yes. I think it is fine.
> 
> Ashok, could you share your opinion on this question?
> 

Yes, rdtsc should be fine for other threads to execute while waiting for the 
microcode update to complete on others. We do the same in Linux as well.

Cheers,
Ashok
diff mbox series

Patch

diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
index 23cf550..f4a417e 100644
--- a/xen/arch/x86/microcode.c
+++ b/xen/arch/x86/microcode.c
@@ -22,6 +22,7 @@ 
  */
 
 #include <xen/cpu.h>
+#include <xen/cpumask.h>
 #include <xen/lib.h>
 #include <xen/kernel.h>
 #include <xen/init.h>
@@ -30,15 +31,34 @@ 
 #include <xen/smp.h>
 #include <xen/softirq.h>
 #include <xen/spinlock.h>
+#include <xen/stop_machine.h>
 #include <xen/tasklet.h>
 #include <xen/guest_access.h>
 #include <xen/earlycpio.h>
+#include <xen/watchdog.h>
 
+#include <asm/delay.h>
 #include <asm/msr.h>
 #include <asm/processor.h>
 #include <asm/setup.h>
 #include <asm/microcode.h>
 
+/*
+ * Before performing a late microcode update on any thread, we
+ * rendezvous all cpus in stop_machine context. The timeout for
+ * waiting for cpu rendezvous is 30ms. It is the timeout used by
+ * live patching
+ */
+#define MICROCODE_CALLIN_TIMEOUT_US 30000
+
+/*
+ * Timeout for each thread to complete update is set to 1s. It is a
+ * conservative choice considering all possible interference (for
+ * instance, sometimes wbinvd takes relative long time). And a perfect
+ * timeout doesn't help a lot except an early shutdown.
+ */
+#define MICROCODE_UPDATE_TIMEOUT_US 1000000
+
 static module_t __initdata ucode_mod;
 static signed int __initdata ucode_mod_idx;
 static bool_t __initdata ucode_mod_forced;
@@ -190,6 +210,12 @@  static DEFINE_SPINLOCK(microcode_mutex);
 DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
 
 /*
+ * Count the CPUs that have entered, exited the rendezvous and succeeded in
+ * microcode update during late microcode update respectively.
+ */
+static atomic_t cpu_in, cpu_out, cpu_updated;
+
+/*
  * Return the patch with the highest revision id among all matching
  * patches in the blob. Return NULL if no suitable patch.
  */
@@ -270,31 +296,90 @@  bool microcode_update_cache(struct microcode_patch *patch)
     return true;
 }
 
-static long do_microcode_update(void *patch)
+/* Wait for CPUs to rendezvous with a timeout (us) */
+static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
+                         unsigned int timeout)
 {
-    int error, cpu;
-
-    error = microcode_update_cpu(patch);
-    if ( error )
+    while ( atomic_read(cnt) < expect )
     {
-        microcode_ops->free_patch(microcode_cache);
-        return error;
+        if ( !timeout )
+        {
+            printk("CPU%d: Timeout when waiting for CPUs calling in\n",
+                   smp_processor_id());
+            return -EBUSY;
+        }
+        udelay(1);
+        timeout--;
     }
 
+    return 0;
+}
 
-    cpu = cpumask_next(smp_processor_id(), &cpu_online_map);
-    if ( cpu < nr_cpu_ids )
-        return continue_hypercall_on_cpu(cpu, do_microcode_update, patch);
+static int do_microcode_update(void *patch)
+{
+    unsigned int cpu = smp_processor_id();
+    unsigned int cpu_nr = num_online_cpus();
+    unsigned int finished;
+    int ret;
+    static bool error;
 
-    microcode_update_cache(patch);
+    atomic_inc(&cpu_in);
+    ret = wait_for_cpus(&cpu_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
+    if ( ret )
+        return ret;
 
-    return error;
+    ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
+    /*
+     * Load microcode update on only one logical processor per core.
+     * Here, among logical processors of a core, the one with the
+     * lowest thread id is chosen to perform the loading.
+     */
+    if ( !ret && (cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu))) )
+    {
+        ret = microcode_ops->apply_microcode(patch);
+        if ( !ret )
+            atomic_inc(&cpu_updated);
+    }
+    /*
+     * Increase the wait timeout to a safe value here since we're serializing
+     * the microcode update and that could take a while on a large number of
+     * CPUs. And that is fine as the *actual* timeout will be determined by
+     * the last CPU finished updating and thus cut short
+     */
+    atomic_inc(&cpu_out);
+    finished = atomic_read(&cpu_out);
+    while ( !error && finished != cpu_nr )
+    {
+        /*
+         * During each timeout interval, at least a CPU is expected to
+         * finish its update. Otherwise, something goes wrong.
+         */
+        if ( wait_for_cpus(&cpu_out, finished + 1,
+                           MICROCODE_UPDATE_TIMEOUT_US) && !error )
+        {
+            error = true;
+            panic("Timeout when finishing updating microcode (finished %d/%d)",
+                  finished, cpu_nr);
+        }
+
+        finished = atomic_read(&cpu_out);
+    }
+
+    /*
+     * Refresh CPU signature (revision) on threads which didn't call
+     * apply_microcode().
+     */
+    if ( cpu != cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
+        ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
+
+    return ret;
 }
 
 int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
 {
     int ret;
     void *buffer;
+    unsigned int cpu, nr_cores;
     struct microcode_patch *patch;
 
     if ( len != (uint32_t)len )
@@ -316,11 +401,18 @@  int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
         goto free;
     }
 
+    /* cpu_online_map must not change during update */
+    if ( !get_cpu_maps() )
+    {
+        ret = -EBUSY;
+        goto free;
+    }
+
     if ( microcode_ops->start_update )
     {
         ret = microcode_ops->start_update();
         if ( ret != 0 )
-            goto free;
+            goto put;
     }
 
     patch = microcode_parse_blob(buffer, len);
@@ -337,12 +429,59 @@  int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
         if ( patch )
             microcode_ops->free_patch(patch);
         ret = -EINVAL;
-        goto free;
+        goto put;
     }
 
-    ret = continue_hypercall_on_cpu(cpumask_first(&cpu_online_map),
-                                    do_microcode_update, patch);
+    atomic_set(&cpu_in, 0);
+    atomic_set(&cpu_out, 0);
+    atomic_set(&cpu_updated, 0);
+
+    /* Calculate the number of online CPU core */
+    nr_cores = 0;
+    for_each_online_cpu(cpu)
+        if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
+            nr_cores++;
+
+    printk(XENLOG_INFO "%d cores are to update their microcode\n", nr_cores);
+
+    /*
+     * We intend to disable interrupt for long time, which may lead to
+     * watchdog timeout.
+     */
+    watchdog_disable();
+    /*
+     * Late loading dance. Why the heavy-handed stop_machine effort?
+     *
+     * - HT siblings must be idle and not execute other code while the other
+     *   sibling is loading microcode in order to avoid any negative
+     *   interactions cause by the loading.
+     *
+     * - In addition, microcode update on the cores must be serialized until
+     *   this requirement can be relaxed in the future. Right now, this is
+     *   conservative and good.
+     */
+    ret = stop_machine_run(do_microcode_update, patch, NR_CPUS);
+    watchdog_enable();
+
+    if ( atomic_read(&cpu_updated) == nr_cores )
+    {
+        spin_lock(&microcode_mutex);
+        microcode_update_cache(patch);
+        spin_unlock(&microcode_mutex);
+    }
+    else if ( atomic_read(&cpu_updated) == 0 )
+        microcode_ops->free_patch(patch);
+    else
+    {
+        printk("Updating microcode succeeded on part of CPUs and failed on\n"
+               "others due to an unknown reason. A system with different\n"
+               "microcode revisions is considered unstable. Please reboot and\n"
+               "do not load the microcode that triggers this warning\n");
+        microcode_ops->free_patch(patch);
+    }
 
+ put:
+    put_cpu_maps();
  free:
     xfree(buffer);
     return ret;