diff mbox series

[v10,14/16] microcode: rendezvous CPUs in NMI handler and load ucode

Message ID 1568272949-1086-15-git-send-email-chao.gao@intel.com (mailing list archive)
State Superseded
Headers show
Series improve late microcode loading | expand

Commit Message

Chao Gao Sept. 12, 2019, 7:22 a.m. UTC
When one core is loading ucode, handling NMI on sibling threads or
on other cores in the system might be problematic. By rendezvousing
all CPUs in NMI handler, it prevents NMI acceptance during ucode
loading.

Basically, some work previously done in stop_machine context is
moved to NMI handler. Primary threads call in and load ucode in
NMI handler. Secondary threads wait for the completion of ucode
loading on all CPU cores. An option is introduced to disable this
behavior.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
---
Changes in v10:
 - rewrite based on Sergey's idea and patch
 - add Sergey's SOB.
 - add an option to disable ucode loading in NMI handler
 - don't send IPI NMI to the control thread to avoid unknown_nmi_error()
 in do_nmi().
 - add an assertion to make sure the cpu chosen to handle platform NMI
 won't send self NMI. Otherwise, there is a risk that we encounter
 unknown_nmi_error() and system crashes.

Changes in v9:
 - control threads send NMI to all other threads. Slave threads will
 stay in the NMI handling to prevent NMI acceptance during ucode
 loading. Note that self-nmi is invalid according to SDM.
 - s/rep_nop/cpu_relax
 - remove debug message in microcode_nmi_callback(). Printing debug
 message would take long times and control thread may timeout.
 - rebase and fix conflicts

Changes in v8:
 - new
---
 docs/misc/xen-command-line.pandoc | 10 +++++
 xen/arch/x86/microcode.c          | 95 ++++++++++++++++++++++++++++++++-------
 xen/arch/x86/traps.c              |  6 ++-
 xen/include/asm-x86/nmi.h         |  3 ++
 4 files changed, 96 insertions(+), 18 deletions(-)

Comments

Jan Beulich Sept. 13, 2019, 9:14 a.m. UTC | #1
On 12.09.2019 09:22, Chao Gao wrote:
> When one core is loading ucode, handling NMI on sibling threads or
> on other cores in the system might be problematic. By rendezvousing
> all CPUs in NMI handler, it prevents NMI acceptance during ucode
> loading.
> 
> Basically, some work previously done in stop_machine context is
> moved to NMI handler. Primary threads call in and load ucode in
> NMI handler. Secondary threads wait for the completion of ucode
> loading on all CPU cores. An option is introduced to disable this
> behavior.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>



> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2056,6 +2056,16 @@ microcode in the cpio name space must be:
>    - on Intel: kernel/x86/microcode/GenuineIntel.bin
>    - on AMD  : kernel/x86/microcode/AuthenticAMD.bin
>  
> +### ucode_loading_in_nmi (x86)
> +> `= <boolean>`
> +
> +> Default: `true`
> +
> +When one CPU is loading ucode, handling NMIs on sibling threads or threads on
> +other cores might cause problems. By default, all CPUs rendezvous in NMI handler
> +and load ucode. This option provides a way to disable it in case of some CPUs
> +don't allow ucode loading in NMI handler.

We already have "ucode=", why don't you extend it to allow "ucode=nmi"
and "ucode=no-nmi"? (In any event, please no underscores in new
command line options - use hyphens if necessary.)

> @@ -232,6 +237,7 @@ DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
>   */
>  static cpumask_t cpu_callin_map;
>  static atomic_t cpu_out, cpu_updated;
> +const struct microcode_patch *nmi_patch;

static

> @@ -354,6 +360,50 @@ static void set_state(unsigned int state)
>      smp_wmb();
>  }
>  
> +static int secondary_thread_work(void)
> +{
> +    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
> +
> +    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
> +}
> +
> +static int primary_thread_work(const struct microcode_patch *patch)

I think it would be nice if both functions carried "nmi" in their
names - how about {primary,secondary}_nmi_work()? Or wait - the
primary one gets used outside of NMI as well, so I'm fine with its
name. The secondary one, otoh, is NMI-specific and also its only
caller doesn't care about the return value, so I'd suggest making
it return void alongside adding some form of "nmi" to its name. Or,
perhaps even better, have secondary_thread_fn() call it, moving the
cpu_sig update here (and of course then there shouldn't be any
"nmi" added to its name).

> +static int microcode_nmi_callback(const struct cpu_user_regs *regs, int cpu)
> +{
> +    unsigned int primary = cpumask_first(this_cpu(cpu_sibling_mask));
> +    unsigned int controller = cpumask_first(&cpu_online_map);
> +
> +    /* System-generated NMI, will be ignored */
> +    if ( loading_state != LOADING_CALLIN )
> +        return 0;

I'm not happy at all to see NMIs being ignored. But by returning
zero, you do _not_ ignore it. Did you perhaps mean "will be ignored
here", in which case perhaps better "leave to main handler"? And
for the comment to extend to the other two conditions right below,
I think it would be better to combine them all into a single if().

Also, throughout the series, I think you want to consistently use
ACCESS_ONCE() for reads/writes from/to loading_state.

> +    if ( cpu == controller || (!opt_ucode_loading_in_nmi && cpu == primary) )
> +        return 0;

Why not

    if ( cpu == controller || !opt_ucode_loading_in_nmi )
        return 0;

? (And then, there being just a single use each in this function, I
don't think there's a need for the two local variables.)

> @@ -361,10 +411,7 @@ static int secondary_thread_fn(void)
>      if ( !wait_for_state(LOADING_CALLIN) )
>          return -EBUSY;
>  
> -    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
> -
> -    if ( !wait_for_state(LOADING_EXIT) )
> -        return -EBUSY;
> +    self_nmi();

Loosing the -EBUSY indication here isn't very nice. Perhaps this
should be conveyed via a per-CPU variable?

> @@ -379,15 +426,10 @@ static int primary_thread_fn(const struct microcode_patch *patch)
>      if ( !wait_for_state(LOADING_CALLIN) )
>          return -EBUSY;
>  
> -    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
> -
> -    if ( !wait_for_state(LOADING_ENTER) )
> -        return -EBUSY;
> -
> -    ret = microcode_ops->apply_microcode(patch);
> -    if ( !ret )
> -        atomic_inc(&cpu_updated);
> -    atomic_inc(&cpu_out);
> +    if ( opt_ucode_loading_in_nmi )
> +        self_nmi();

Same here.

> @@ -404,6 +447,9 @@ static int control_thread_fn(const struct microcode_patch *patch)
>       */
>      watchdog_disable();
>  
> +    nmi_patch = patch;
> +    saved_nmi_callback = set_nmi_callback(microcode_nmi_callback);

Shouldn't there be smb_wmb() between these two?

> @@ -458,6 +513,7 @@ static int control_thread_fn(const struct microcode_patch *patch)
>      /* Mark loading is done to unblock other threads */
>      set_state(LOADING_EXIT);
>  
> +    set_nmi_callback(saved_nmi_callback);

To be on the safe side, I think you also want to clear nmi_patch again.
Or maybe even better not clear it, but set it to a non-NULL value which,
when accessed, would trap (e.g. ZERO_BLOCK_PTR). This value should then
also be the variable's initializer.

> @@ -522,6 +578,13 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>          goto free;
>      }
>  
> +    /*
> +     * CPUs except the first online CPU would send a fake (self) NMI to
> +     * rendezvous in NMI handler. But a fake NMI to nmi_cpu may trigger
> +     * unknown_nmi_error(). It ensures nmi_cpu won't receive a fake NMI.
> +     */
> +    ASSERT( !cpu_online(nmi_cpu) || nmi_cpu == cpumask_first(&cpu_online_map) );

Please drop the blanks immediately inside the parentheses.

As to the left side of the || - is this really needed? It surely would
be very wrong (but entirely unrelated to ucode loading) if the CPU to
receive platform NMIs was offline.

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -126,6 +126,8 @@ boolean_param("ler", opt_ler);
>  /* LastExceptionFromIP on this hardware.  Zero if LER is not in use. */
>  unsigned int __read_mostly ler_msr;
>  
> +unsigned int __read_mostly nmi_cpu;

Since this variable (for now) is never written to it should gain a
comment saying why this is, and perhaps it would then also better be
const rather than __read_mostly.

Jan
Jan Beulich Sept. 13, 2019, 9:18 a.m. UTC | #2
On 12.09.2019 09:22, Chao Gao wrote:
> @@ -419,14 +465,23 @@ static int control_thread_fn(const struct microcode_patch *patch)
>          return ret;
>      }
>  
> -    /* Let primary threads load the given ucode update */
> -    set_state(LOADING_ENTER);
> -
> +    /* Control thread loads ucode first while others are in NMI handler. */
>      ret = microcode_ops->apply_microcode(patch);
>      if ( !ret )
>          atomic_inc(&cpu_updated);
>      atomic_inc(&cpu_out);
>  
> +    if ( ret == -EIO )
> +    {
> +        printk(XENLOG_ERR
> +               "Late loading aborted: CPU%u failed to update ucode\n", cpu);
> +        set_state(LOADING_EXIT);
> +        return ret;
> +    }
> +
> +    /* Let primary threads load the given ucode update */
> +    set_state(LOADING_ENTER);

One more question - why this deferral of setting state to ENTER? If
it's needed, it should be explained in the description.

Jan
Chao Gao Sept. 16, 2019, 3:18 a.m. UTC | #3
On Fri, Sep 13, 2019 at 11:14:59AM +0200, Jan Beulich wrote:
>On 12.09.2019 09:22, Chao Gao wrote:
>> When one core is loading ucode, handling NMI on sibling threads or
>> on other cores in the system might be problematic. By rendezvousing
>> all CPUs in NMI handler, it prevents NMI acceptance during ucode
>> loading.
>> 
>> Basically, some work previously done in stop_machine context is
>> moved to NMI handler. Primary threads call in and load ucode in
>> NMI handler. Secondary threads wait for the completion of ucode
>> loading on all CPU cores. An option is introduced to disable this
>> behavior.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
>
>
>
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -2056,6 +2056,16 @@ microcode in the cpio name space must be:
>>    - on Intel: kernel/x86/microcode/GenuineIntel.bin
>>    - on AMD  : kernel/x86/microcode/AuthenticAMD.bin
>>  
>> +### ucode_loading_in_nmi (x86)
>> +> `= <boolean>`
>> +
>> +> Default: `true`
>> +
>> +When one CPU is loading ucode, handling NMIs on sibling threads or threads on
>> +other cores might cause problems. By default, all CPUs rendezvous in NMI handler
>> +and load ucode. This option provides a way to disable it in case of some CPUs
>> +don't allow ucode loading in NMI handler.
>
>We already have "ucode=", why don't you extend it to allow "ucode=nmi"
>and "ucode=no-nmi"? (In any event, please no underscores in new
>command line options - use hyphens if necessary.)

Ok. Will extend the "ucode" parameter.

>
>> @@ -232,6 +237,7 @@ DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
>>   */
>>  static cpumask_t cpu_callin_map;
>>  static atomic_t cpu_out, cpu_updated;
>> +const struct microcode_patch *nmi_patch;
>
>static
>
>> @@ -354,6 +360,50 @@ static void set_state(unsigned int state)
>>      smp_wmb();
>>  }
>>  
>> +static int secondary_thread_work(void)
>> +{
>> +    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>> +
>> +    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
>> +}
>> +
>> +static int primary_thread_work(const struct microcode_patch *patch)
>
>I think it would be nice if both functions carried "nmi" in their
>names - how about {primary,secondary}_nmi_work()? Or wait - the
>primary one gets used outside of NMI as well, so I'm fine with its
>name.
>The secondary one, otoh, is NMI-specific and also its only
>caller doesn't care about the return value, so I'd suggest making
>it return void alongside adding some form of "nmi" to its name. Or,

Will do.

>perhaps even better, have secondary_thread_fn() call it, moving the
>cpu_sig update here (and of course then there shouldn't be any
>"nmi" added to its name).

Even with "ucode=no-nmi", secondary threads have to do busy-loop in
NMI handling util primary threads completing the update. Otherwise,
it may access MSRs (like SPEC_CTRL) which is considered unsafe.

>
>> +static int microcode_nmi_callback(const struct cpu_user_regs *regs, int cpu)
>> +{
>> +    unsigned int primary = cpumask_first(this_cpu(cpu_sibling_mask));
>> +    unsigned int controller = cpumask_first(&cpu_online_map);
>> +
>> +    /* System-generated NMI, will be ignored */
>> +    if ( loading_state != LOADING_CALLIN )
>> +        return 0;
>
>I'm not happy at all to see NMIs being ignored. But by returning
>zero, you do _not_ ignore it. Did you perhaps mean "will be ignored
>here", in which case perhaps better "leave to main handler"? And
>for the comment to extend to the other two conditions right below,
>I think it would be better to combine them all into a single if().
>
>Also, throughout the series, I think you want to consistently use
>ACCESS_ONCE() for reads/writes from/to loading_state.
>
>> +    if ( cpu == controller || (!opt_ucode_loading_in_nmi && cpu == primary) )
>> +        return 0;
>
>Why not

As I said above, secondary threads are expected to stay in NMI handler
regardless the setting of opt_ucode_loading_in_nmi.

>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -126,6 +126,8 @@ boolean_param("ler", opt_ler);
>>  /* LastExceptionFromIP on this hardware.  Zero if LER is not in use. */
>>  unsigned int __read_mostly ler_msr;
>>  
>> +unsigned int __read_mostly nmi_cpu;
>
>Since this variable (for now) is never written to it should gain a
>comment saying why this is, and perhaps it would then also better be
>const rather than __read_mostly.

How about use the macro below:
#define NMI_CPU 0

Thanks
Chao
Jan Beulich Sept. 16, 2019, 8:22 a.m. UTC | #4
On 16.09.2019 05:18, Chao Gao wrote:
> On Fri, Sep 13, 2019 at 11:14:59AM +0200, Jan Beulich wrote:
>> On 12.09.2019 09:22, Chao Gao wrote:
>>> @@ -354,6 +360,50 @@ static void set_state(unsigned int state)
>>>      smp_wmb();
>>>  }
>>>  
>>> +static int secondary_thread_work(void)
>>> +{
>>> +    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>>> +
>>> +    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
>>> +}
>>> +
>>> +static int primary_thread_work(const struct microcode_patch *patch)
>>
>> I think it would be nice if both functions carried "nmi" in their
>> names - how about {primary,secondary}_nmi_work()? Or wait - the
>> primary one gets used outside of NMI as well, so I'm fine with its
>> name.
>> The secondary one, otoh, is NMI-specific and also its only
>> caller doesn't care about the return value, so I'd suggest making
>> it return void alongside adding some form of "nmi" to its name. Or,
> 
> Will do.
> 
>> perhaps even better, have secondary_thread_fn() call it, moving the
>> cpu_sig update here (and of course then there shouldn't be any
>> "nmi" added to its name).
> 
> Even with "ucode=no-nmi", secondary threads have to do busy-loop in
> NMI handling util primary threads completing the update. Otherwise,
> it may access MSRs (like SPEC_CTRL) which is considered unsafe.

Of course. Note that I said "call it"; I did not suggest to replace
secondary_thread_fn().

>>> +static int microcode_nmi_callback(const struct cpu_user_regs *regs, int cpu)
>>> +{
>>> +    unsigned int primary = cpumask_first(this_cpu(cpu_sibling_mask));
>>> +    unsigned int controller = cpumask_first(&cpu_online_map);
>>> +
>>> +    /* System-generated NMI, will be ignored */
>>> +    if ( loading_state != LOADING_CALLIN )
>>> +        return 0;
>>
>> I'm not happy at all to see NMIs being ignored. But by returning
>> zero, you do _not_ ignore it. Did you perhaps mean "will be ignored
>> here", in which case perhaps better "leave to main handler"? And
>> for the comment to extend to the other two conditions right below,
>> I think it would be better to combine them all into a single if().
>>
>> Also, throughout the series, I think you want to consistently use
>> ACCESS_ONCE() for reads/writes from/to loading_state.
>>
>>> +    if ( cpu == controller || (!opt_ucode_loading_in_nmi && cpu == primary) )
>>> +        return 0;
>>
>> Why not
> 
> As I said above, secondary threads are expected to stay in NMI handler
> regardless the setting of opt_ucode_loading_in_nmi.

Oh, here I see how your remark above matters. Please add code
comments then to make this clear to the reader.

>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -126,6 +126,8 @@ boolean_param("ler", opt_ler);
>>>  /* LastExceptionFromIP on this hardware.  Zero if LER is not in use. */
>>>  unsigned int __read_mostly ler_msr;
>>>  
>>> +unsigned int __read_mostly nmi_cpu;
>>
>> Since this variable (for now) is never written to it should gain a
>> comment saying why this is, and perhaps it would then also better be
>> const rather than __read_mostly.
> 
> How about use the macro below:
> #define NMI_CPU 0

This is another option, yes. If there's any intention to ever allow
offlining CPU 0, then having the variable in place would seem better
to me. But I'll leave it to you at this point.

Jan
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 7c72e31..3017073 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2056,6 +2056,16 @@  microcode in the cpio name space must be:
   - on Intel: kernel/x86/microcode/GenuineIntel.bin
   - on AMD  : kernel/x86/microcode/AuthenticAMD.bin
 
+### ucode_loading_in_nmi (x86)
+> `= <boolean>`
+
+> Default: `true`
+
+When one CPU is loading ucode, handling NMIs on sibling threads or threads on
+other cores might cause problems. By default, all CPUs rendezvous in NMI handler
+and load ucode. This option provides a way to disable it in case of some CPUs
+don't allow ucode loading in NMI handler.
+
 ### unrestricted_guest (Intel)
 > `= <boolean>`
 
diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
index 049eda6..64a4321 100644
--- a/xen/arch/x86/microcode.c
+++ b/xen/arch/x86/microcode.c
@@ -36,8 +36,10 @@ 
 #include <xen/earlycpio.h>
 #include <xen/watchdog.h>
 
+#include <asm/apic.h>
 #include <asm/delay.h>
 #include <asm/msr.h>
+#include <asm/nmi.h>
 #include <asm/processor.h>
 #include <asm/setup.h>
 #include <asm/microcode.h>
@@ -125,6 +127,9 @@  static int __init parse_ucode(const char *s)
 }
 custom_param("ucode", parse_ucode);
 
+static bool __read_mostly opt_ucode_loading_in_nmi = true;
+boolean_runtime_param("ucode_loading_in_nmi", opt_ucode_loading_in_nmi);
+
 /*
  * 8MB ought to be enough.
  */
@@ -232,6 +237,7 @@  DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
  */
 static cpumask_t cpu_callin_map;
 static atomic_t cpu_out, cpu_updated;
+const struct microcode_patch *nmi_patch;
 
 /*
  * Return a patch that covers current CPU. If there are multiple patches,
@@ -354,6 +360,50 @@  static void set_state(unsigned int state)
     smp_wmb();
 }
 
+static int secondary_thread_work(void)
+{
+    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
+
+    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
+}
+
+static int primary_thread_work(const struct microcode_patch *patch)
+{
+    int ret;
+
+    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
+
+    if ( !wait_for_state(LOADING_ENTER) )
+        return -EBUSY;
+
+    ret = microcode_ops->apply_microcode(patch);
+    if ( !ret )
+        atomic_inc(&cpu_updated);
+    atomic_inc(&cpu_out);
+
+    return ret;
+}
+
+static int microcode_nmi_callback(const struct cpu_user_regs *regs, int cpu)
+{
+    unsigned int primary = cpumask_first(this_cpu(cpu_sibling_mask));
+    unsigned int controller = cpumask_first(&cpu_online_map);
+
+    /* System-generated NMI, will be ignored */
+    if ( loading_state != LOADING_CALLIN )
+        return 0;
+
+    if ( cpu == controller || (!opt_ucode_loading_in_nmi && cpu == primary) )
+        return 0;
+
+    if ( cpu == primary )
+        primary_thread_work(nmi_patch);
+    else
+        secondary_thread_work();
+
+    return 0;
+}
+
 static int secondary_thread_fn(void)
 {
     unsigned int primary = cpumask_first(this_cpu(cpu_sibling_mask));
@@ -361,10 +411,7 @@  static int secondary_thread_fn(void)
     if ( !wait_for_state(LOADING_CALLIN) )
         return -EBUSY;
 
-    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
-
-    if ( !wait_for_state(LOADING_EXIT) )
-        return -EBUSY;
+    self_nmi();
 
     /* Copy update revision from the primary thread. */
     this_cpu(cpu_sig).rev = per_cpu(cpu_sig, primary).rev;
@@ -379,15 +426,10 @@  static int primary_thread_fn(const struct microcode_patch *patch)
     if ( !wait_for_state(LOADING_CALLIN) )
         return -EBUSY;
 
-    cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
-
-    if ( !wait_for_state(LOADING_ENTER) )
-        return -EBUSY;
-
-    ret = microcode_ops->apply_microcode(patch);
-    if ( !ret )
-        atomic_inc(&cpu_updated);
-    atomic_inc(&cpu_out);
+    if ( opt_ucode_loading_in_nmi )
+        self_nmi();
+    else
+        ret = primary_thread_work(patch);
 
     return ret;
 }
@@ -397,6 +439,7 @@  static int control_thread_fn(const struct microcode_patch *patch)
     unsigned int cpu = smp_processor_id(), done;
     unsigned long tick;
     int ret;
+    nmi_callback_t *saved_nmi_callback;
 
     /*
      * We intend to disable interrupt for long time, which may lead to
@@ -404,6 +447,9 @@  static int control_thread_fn(const struct microcode_patch *patch)
      */
     watchdog_disable();
 
+    nmi_patch = patch;
+    saved_nmi_callback = set_nmi_callback(microcode_nmi_callback);
+
     /* Allow threads to call in */
     set_state(LOADING_CALLIN);
 
@@ -419,14 +465,23 @@  static int control_thread_fn(const struct microcode_patch *patch)
         return ret;
     }
 
-    /* Let primary threads load the given ucode update */
-    set_state(LOADING_ENTER);
-
+    /* Control thread loads ucode first while others are in NMI handler. */
     ret = microcode_ops->apply_microcode(patch);
     if ( !ret )
         atomic_inc(&cpu_updated);
     atomic_inc(&cpu_out);
 
+    if ( ret == -EIO )
+    {
+        printk(XENLOG_ERR
+               "Late loading aborted: CPU%u failed to update ucode\n", cpu);
+        set_state(LOADING_EXIT);
+        return ret;
+    }
+
+    /* Let primary threads load the given ucode update */
+    set_state(LOADING_ENTER);
+
     tick = rdtsc_ordered();
     /* Wait for primary threads finishing update */
     done = atomic_read(&cpu_out);
@@ -458,6 +513,7 @@  static int control_thread_fn(const struct microcode_patch *patch)
     /* Mark loading is done to unblock other threads */
     set_state(LOADING_EXIT);
 
+    set_nmi_callback(saved_nmi_callback);
     watchdog_enable();
 
     return ret;
@@ -522,6 +578,13 @@  int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
         goto free;
     }
 
+    /*
+     * CPUs except the first online CPU would send a fake (self) NMI to
+     * rendezvous in NMI handler. But a fake NMI to nmi_cpu may trigger
+     * unknown_nmi_error(). It ensures nmi_cpu won't receive a fake NMI.
+     */
+    ASSERT( !cpu_online(nmi_cpu) || nmi_cpu == cpumask_first(&cpu_online_map) );
+
     patch = parse_blob(buffer, len);
     if ( IS_ERR(patch) )
     {
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 16c590d..503f5c8 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -126,6 +126,8 @@  boolean_param("ler", opt_ler);
 /* LastExceptionFromIP on this hardware.  Zero if LER is not in use. */
 unsigned int __read_mostly ler_msr;
 
+unsigned int __read_mostly nmi_cpu;
+
 #define stack_words_per_line 4
 #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)regs->rsp)
 
@@ -1679,7 +1681,7 @@  void do_nmi(const struct cpu_user_regs *regs)
      * this port before we re-arm the NMI watchdog, we reduce the chance
      * of having an NMI watchdog expire while in the SMI handler.
      */
-    if ( cpu == 0 )
+    if ( cpu == nmi_cpu )
         reason = inb(0x61);
 
     if ( (nmi_watchdog == NMI_NONE) ||
@@ -1687,7 +1689,7 @@  void do_nmi(const struct cpu_user_regs *regs)
         handle_unknown = true;
 
     /* Only the BSP gets external NMIs from the system. */
-    if ( cpu == 0 )
+    if ( cpu == nmi_cpu )
     {
         if ( reason & 0x80 )
             pci_serr_error(regs);
diff --git a/xen/include/asm-x86/nmi.h b/xen/include/asm-x86/nmi.h
index 99f6284..dbebffe 100644
--- a/xen/include/asm-x86/nmi.h
+++ b/xen/include/asm-x86/nmi.h
@@ -11,6 +11,9 @@  extern bool opt_watchdog;
 
 /* Watchdog force parameter from the command line */
 extern bool watchdog_force;
+
+/* CPU to handle platform NMI */
+extern unsigned int nmi_cpu;
  
 typedef int nmi_callback_t(const struct cpu_user_regs *regs, int cpu);