diff mbox series

[v7,09/10] microcode: remove microcode_update_lock

Message ID 1558945891-3015-10-git-send-email-chao.gao@intel.com (mailing list archive)
State Superseded
Headers show
Series improve late microcode loading | expand

Commit Message

Chao Gao May 27, 2019, 8:31 a.m. UTC
microcode_update_lock is to prevent logic threads of a same core from
updating microcode at the same time. But due to using a global lock, it
also prevented parallel microcode updating on different cores.

Remove this lock in order to update microcode in parallel. It is safe
because we have already ensured serialization of sibling threads at the
caller side.
1.For late microcode update, do_microcode_update() ensures that only one
  sibiling thread of a core can update microcode.
2.For microcode update during system startup or CPU-hotplug,
  microcode_mutex() guarantees update serialization of logical threads.
3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
  late microcode update.

Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
only) are still processed sequentially.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
Changes in v7:
 - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
 patch to be applied is passed as an argument without any global variable. Thus
 no lock is added to serialize potential readers/writers. Callers of
 apply_microcode() will guarantee the correctness: the patch poninted by the
 arguments won't be changed by others.

Changes in v6:
 - introduce early_ucode_update_lock to serialize early ucode update.

Changes in v5:
 - newly add
---
 xen/arch/x86/microcode_amd.c   | 8 +-------
 xen/arch/x86/microcode_intel.c | 8 +-------
 2 files changed, 2 insertions(+), 14 deletions(-)

Comments

Roger Pau Monné June 5, 2019, 2:52 p.m. UTC | #1
On Mon, May 27, 2019 at 04:31:30PM +0800, Chao Gao wrote:
> microcode_update_lock is to prevent logic threads of a same core from
> updating microcode at the same time. But due to using a global lock, it
> also prevented parallel microcode updating on different cores.

Oh, OK, so that's what I was missing from patch 8 and what serializes
the updating.

> Remove this lock in order to update microcode in parallel. It is safe
> because we have already ensured serialization of sibling threads at the
> caller side.

Then you certainly need to fix the wait loop in do_microcode_update to
only wait for MICROCODE_UPDATE_TIMEOUT_US regardless of the number of
CPUs in the system?

> 1.For late microcode update, do_microcode_update() ensures that only one
>   sibiling thread of a core can update microcode.
> 2.For microcode update during system startup or CPU-hotplug,
>   microcode_mutex() guarantees update serialization of logical threads.
> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>   late microcode update.
> 
> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
> only) are still processed sequentially.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Patch LGTM, but it needs to fix the wait loop in do_microcode_update.

Thanks, Roger.
Jan Beulich June 5, 2019, 2:53 p.m. UTC | #2
>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
> microcode_update_lock is to prevent logic threads of a same core from
> updating microcode at the same time. But due to using a global lock, it
> also prevented parallel microcode updating on different cores.
> 
> Remove this lock in order to update microcode in parallel. It is safe
> because we have already ensured serialization of sibling threads at the
> caller side.
> 1.For late microcode update, do_microcode_update() ensures that only one
>   sibiling thread of a core can update microcode.
> 2.For microcode update during system startup or CPU-hotplug,
>   microcode_mutex() guarantees update serialization of logical threads.
> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>   late microcode update.
> 
> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
> only) are still processed sequentially.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> ---
> Changes in v7:
>  - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
>  patch to be applied is passed as an argument without any global variable. Thus
>  no lock is added to serialize potential readers/writers. Callers of
>  apply_microcode() will guarantee the correctness: the patch poninted by the
>  arguments won't be changed by others.

Much better this way indeed.

> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>  
>      mc_intel = patch->mc_intel;
>  
> -    /* serialize access to the physical write to MSR 0x79 */
> -    spin_lock_irqsave(&microcode_update_lock, flags);
> +    BUG_ON(local_irq_is_enabled());
>  
>      /*
>       * Writeback and invalidate caches before updating microcode to avoid

Thinking about it - what happens if we hit an NMI or #MC here?
watchdog_disable(), a call to which you add in an earlier patch,
doesn't really suppress the generation of NMIs, it only tells the
handler not to look at the accumulated statistics.

Jan
Jan Beulich June 5, 2019, 3:15 p.m. UTC | #3
>>> On 05.06.19 at 16:52, <roger.pau@citrix.com> wrote:
> On Mon, May 27, 2019 at 04:31:30PM +0800, Chao Gao wrote:
>> microcode_update_lock is to prevent logic threads of a same core from
>> updating microcode at the same time. But due to using a global lock, it
>> also prevented parallel microcode updating on different cores.
> 
> Oh, OK, so that's what I was missing from patch 8 and what serializes
> the updating.
> 
>> Remove this lock in order to update microcode in parallel. It is safe
>> because we have already ensured serialization of sibling threads at the
>> caller side.
> 
> Then you certainly need to fix the wait loop in do_microcode_update to
> only wait for MICROCODE_UPDATE_TIMEOUT_US regardless of the number of
> CPUs in the system?

Well, no, not exactly. On huge systems it may indeed still take longer
than on smaller ones. The way the waiting is coded now (expecting
forward progress in every MICROCODE_UPDATE_TIMEOUT_US period, is,
I think, acceptable. Plus leaving that logic alone will avoid touching it
yet again when introducing serial application policy as an option.

Jan
Chao Gao June 11, 2019, 12:46 p.m. UTC | #4
On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>> microcode_update_lock is to prevent logic threads of a same core from
>> updating microcode at the same time. But due to using a global lock, it
>> also prevented parallel microcode updating on different cores.
>> 
>> Remove this lock in order to update microcode in parallel. It is safe
>> because we have already ensured serialization of sibling threads at the
>> caller side.
>> 1.For late microcode update, do_microcode_update() ensures that only one
>>   sibiling thread of a core can update microcode.
>> 2.For microcode update during system startup or CPU-hotplug,
>>   microcode_mutex() guarantees update serialization of logical threads.
>> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>>   late microcode update.
>> 
>> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
>> only) are still processed sequentially.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>
>Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> ---
>> Changes in v7:
>>  - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
>>  patch to be applied is passed as an argument without any global variable. Thus
>>  no lock is added to serialize potential readers/writers. Callers of
>>  apply_microcode() will guarantee the correctness: the patch poninted by the
>>  arguments won't be changed by others.
>
>Much better this way indeed.
>
>> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>>  
>>      mc_intel = patch->mc_intel;
>>  
>> -    /* serialize access to the physical write to MSR 0x79 */
>> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> +    BUG_ON(local_irq_is_enabled());
>>  
>>      /*
>>       * Writeback and invalidate caches before updating microcode to avoid
>
>Thinking about it - what happens if we hit an NMI or #MC here?
>watchdog_disable(), a call to which you add in an earlier patch,
>doesn't really suppress the generation of NMIs, it only tells the
>handler not to look at the accumulated statistics.

I think they should be suppressed. Ashok, could you confirm it?

I will figure out how to suppress them in Xen.

Thanks
Chao
Jan Beulich June 11, 2019, 1:23 p.m. UTC | #5
>>> On 11.06.19 at 14:46, <chao.gao@intel.com> wrote:
> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>>> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>>>  
>>>      mc_intel = patch->mc_intel;
>>>  
>>> -    /* serialize access to the physical write to MSR 0x79 */
>>> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>> +    BUG_ON(local_irq_is_enabled());
>>>  
>>>      /*
>>>       * Writeback and invalidate caches before updating microcode to avoid
>>
>>Thinking about it - what happens if we hit an NMI or #MC here?
>>watchdog_disable(), a call to which you add in an earlier patch,
>>doesn't really suppress the generation of NMIs, it only tells the
>>handler not to look at the accumulated statistics.
> 
> I think they should be suppressed. Ashok, could you confirm it?
> 
> I will figure out how to suppress them in Xen.

Well, afaik suppressing #MC is impossible. CR4.MCE clear leads
to immediate shutdown in case of a machine check, unless I'm
mistaken. The watchdog NMI, otoh, could of course be "properly"
disabled.

Jan
Ashok Raj June 11, 2019, 4:04 p.m. UTC | #6
On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
> >
> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
> >>  
> >>      mc_intel = patch->mc_intel;
> >>  
> >> -    /* serialize access to the physical write to MSR 0x79 */
> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
> >> +    BUG_ON(local_irq_is_enabled());
> >>  
> >>      /*
> >>       * Writeback and invalidate caches before updating microcode to avoid
> >
> >Thinking about it - what happens if we hit an NMI or #MC here?
> >watchdog_disable(), a call to which you add in an earlier patch,
> >doesn't really suppress the generation of NMIs, it only tells the
> >handler not to look at the accumulated statistics.
> 
> I think they should be suppressed. Ashok, could you confirm it?

I think the only sources would be the watchdog as you pointed out
which we already touch to keep it from expiring. The perf counters
i'm not an expert in, but i'll check. When we are in stop_machine() type
flow, its not clear if any of those would fire. (I might be wrong, but let
me check).

#MC shouldn't fire once you entered the rendezvous, if it does its usually
fatal like a 3strike or something. Machine is going down at that time
so nothing we could do to handle.

Cheers,
Ashok
Jan Beulich June 12, 2019, 7:38 a.m. UTC | #7
>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>> >
>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> *patch)
>> >>  
>> >>      mc_intel = patch->mc_intel;
>> >>  
>> >> -    /* serialize access to the physical write to MSR 0x79 */
>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> >> +    BUG_ON(local_irq_is_enabled());
>> >>  
>> >>      /*
>> >>       * Writeback and invalidate caches before updating microcode to avoid
>> >
>> >Thinking about it - what happens if we hit an NMI or #MC here?
>> >watchdog_disable(), a call to which you add in an earlier patch,
>> >doesn't really suppress the generation of NMIs, it only tells the
>> >handler not to look at the accumulated statistics.
>> 
>> I think they should be suppressed. Ashok, could you confirm it?
> 
> I think the only sources would be the watchdog as you pointed out
> which we already touch to keep it from expiring. The perf counters
> i'm not an expert in, but i'll check. When we are in stop_machine() type
> flow, its not clear if any of those would fire. (I might be wrong, but let
> me check).

Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
how would it _not_ potentially fire?

> #MC shouldn't fire once you entered the rendezvous, if it does its usually
> fatal like a 3strike or something. Machine is going down at that time
> so nothing we could do to handle.

Right - as long as we assume that #MC would be fatal anyway,
there's no point in thinking about ways to suppress it. I guess
it is fatal (almost) always right now, but I don't think it ought to
be. It's just that no-one has the time and environment to make
it actually behave better.

Jan
Chao Gao June 13, 2019, 2:05 p.m. UTC | #8
On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>> >
>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>> *patch)
>>> >>  
>>> >>      mc_intel = patch->mc_intel;
>>> >>  
>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>> >> +    BUG_ON(local_irq_is_enabled());
>>> >>  
>>> >>      /*
>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>> >
>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>> >doesn't really suppress the generation of NMIs, it only tells the
>>> >handler not to look at the accumulated statistics.
>>> 
>>> I think they should be suppressed. Ashok, could you confirm it?
>> 
>> I think the only sources would be the watchdog as you pointed out
>> which we already touch to keep it from expiring. The perf counters
>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>> flow, its not clear if any of those would fire. (I might be wrong, but let
>> me check).
>
>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>how would it _not_ potentially fire?

We plan not to prevent NMI being fired. Instead, if one thread of a core
is updating microcode, other threads of this core would stop in the
handler of NMI until the update completion. Is this approach acceptable?

Thanks
Chao
Jan Beulich June 13, 2019, 2:08 p.m. UTC | #9
>>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
> On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>> >
>>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>>> *patch)
>>>> >>  
>>>> >>      mc_intel = patch->mc_intel;
>>>> >>  
>>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>>> >> +    BUG_ON(local_irq_is_enabled());
>>>> >>  
>>>> >>      /*
>>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>>> >
>>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>>> >doesn't really suppress the generation of NMIs, it only tells the
>>>> >handler not to look at the accumulated statistics.
>>>> 
>>>> I think they should be suppressed. Ashok, could you confirm it?
>>> 
>>> I think the only sources would be the watchdog as you pointed out
>>> which we already touch to keep it from expiring. The perf counters
>>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>>> flow, its not clear if any of those would fire. (I might be wrong, but let
>>> me check).
>>
>>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>>how would it _not_ potentially fire?
> 
> We plan not to prevent NMI being fired. Instead, if one thread of a core
> is updating microcode, other threads of this core would stop in the
> handler of NMI until the update completion. Is this approach acceptable?

Well, I have to return the question: It is you who knows what is or
is not acceptable while an ucode update is in progress. In particular
it obviously matters how much ucode is involved in the delivery of
an NMI (and in allowing the handler to get to the point where you'd
"stop" it).

If the approach you suggest is fine for the NMI case, I'd then wonder
if it couldn't also be used for the #MC one.

Jan
Chao Gao June 13, 2019, 2:58 p.m. UTC | #10
On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
>>>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
>> On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>>>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>>> >
>>>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>>>> *patch)
>>>>> >>  
>>>>> >>      mc_intel = patch->mc_intel;
>>>>> >>  
>>>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>>>> >> +    BUG_ON(local_irq_is_enabled());
>>>>> >>  
>>>>> >>      /*
>>>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>>>> >
>>>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>>>> >doesn't really suppress the generation of NMIs, it only tells the
>>>>> >handler not to look at the accumulated statistics.
>>>>> 
>>>>> I think they should be suppressed. Ashok, could you confirm it?
>>>> 
>>>> I think the only sources would be the watchdog as you pointed out
>>>> which we already touch to keep it from expiring. The perf counters
>>>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>>>> flow, its not clear if any of those would fire. (I might be wrong, but let
>>>> me check).
>>>
>>>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>>>how would it _not_ potentially fire?
>> 
>> We plan not to prevent NMI being fired. Instead, if one thread of a core
>> is updating microcode, other threads of this core would stop in the
>> handler of NMI until the update completion. Is this approach acceptable?
>
>Well, I have to return the question: It is you who knows what is or
>is not acceptable while an ucode update is in progress. In particular
>it obviously matters how much ucode is involved in the delivery of
>an NMI (and in allowing the handler to get to the point where you'd
>"stop" it).
>
>If the approach you suggest is fine for the NMI case,

Yes. It is fine. It is a suggestion from Ashok and what he is working
on in linux kernel. I just wanted to make sure you didn't oppose this
approach in Xen (considering disarming watchdog NMI might be an
alternative).

>I'd then wonder if it couldn't also be used for the #MC one.

I think no much pratical value for #MC because we still need to wait for
the callin of all threads. But as you and Ashok said, #MC is usually
fatal and machine goes down anyway.

Thanks
Chao
Ashok Raj June 13, 2019, 5:47 p.m. UTC | #11
On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
> >>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
> > On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
> >>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
> >>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
> >>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
> >>>> >
> >>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> >>> *patch)
> >>>> >>  
> >>>> >>      mc_intel = patch->mc_intel;
> >>>> >>  
> >>>> >> -    /* serialize access to the physical write to MSR 0x79 */
> >>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
> >>>> >> +    BUG_ON(local_irq_is_enabled());
> >>>> >>  
> >>>> >>      /*
> >>>> >>       * Writeback and invalidate caches before updating microcode to avoid
> >>>> >
> >>>> >Thinking about it - what happens if we hit an NMI or #MC here?
> >>>> >watchdog_disable(), a call to which you add in an earlier patch,
> >>>> >doesn't really suppress the generation of NMIs, it only tells the
> >>>> >handler not to look at the accumulated statistics.
> >>>> 
> >>>> I think they should be suppressed. Ashok, could you confirm it?
> >>> 
> >>> I think the only sources would be the watchdog as you pointed out
> >>> which we already touch to keep it from expiring. The perf counters
> >>> i'm not an expert in, but i'll check. When we are in stop_machine() type
> >>> flow, its not clear if any of those would fire. (I might be wrong, but let
> >>> me check).
> >>
> >>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
> >>how would it _not_ potentially fire?
> > 
> > We plan not to prevent NMI being fired. Instead, if one thread of a core
> > is updating microcode, other threads of this core would stop in the
> > handler of NMI until the update completion. Is this approach acceptable?
> 
> Well, I have to return the question: It is you who knows what is or
> is not acceptable while an ucode update is in progress. In particular
> it obviously matters how much ucode is involved in the delivery of
> an NMI (and in allowing the handler to get to the point where you'd
> "stop" it).
> 
> If the approach you suggest is fine for the NMI case, I'd then wonder
> if it couldn't also be used for the #MC one.

Architecturally only one #MC can be active in the system. If a new #MC 
condition happens when MCG_STATUS.MCIP is already set, that would cause spontaneous 
shutdown.

If another NMI arrives on the CPU doing the wrmsr, it will be pended
in the lapic and delivered after the wrmsr returns. wrmsr flow
can't be interrupted.
Jan Beulich June 14, 2019, 8:58 a.m. UTC | #12
>>> On 13.06.19 at 19:47, <ashok.raj@intel.com> wrote:
> On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
>> >>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
>> > On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>> >>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>> >>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>> >>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>> >>>> >
>> >>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> 
>> >>> *patch)
>> >>>> >>  
>> >>>> >>      mc_intel = patch->mc_intel;
>> >>>> >>  
>> >>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>> >>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> >>>> >> +    BUG_ON(local_irq_is_enabled());
>> >>>> >>  
>> >>>> >>      /*
>> >>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>> >>>> >
>> >>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>> >>>> >watchdog_disable(), a call to which you add in an earlier patch,
>> >>>> >doesn't really suppress the generation of NMIs, it only tells the
>> >>>> >handler not to look at the accumulated statistics.
>> >>>> 
>> >>>> I think they should be suppressed. Ashok, could you confirm it?
>> >>> 
>> >>> I think the only sources would be the watchdog as you pointed out
>> >>> which we already touch to keep it from expiring. The perf counters
>> >>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>> >>> flow, its not clear if any of those would fire. (I might be wrong, but let
>> >>> me check).
>> >>
>> >>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>> >>how would it _not_ potentially fire?
>> > 
>> > We plan not to prevent NMI being fired. Instead, if one thread of a core
>> > is updating microcode, other threads of this core would stop in the
>> > handler of NMI until the update completion. Is this approach acceptable?
>> 
>> Well, I have to return the question: It is you who knows what is or
>> is not acceptable while an ucode update is in progress. In particular
>> it obviously matters how much ucode is involved in the delivery of
>> an NMI (and in allowing the handler to get to the point where you'd
>> "stop" it).
>> 
>> If the approach you suggest is fine for the NMI case, I'd then wonder
>> if it couldn't also be used for the #MC one.
> 
> Architecturally only one #MC can be active in the system. If a new #MC 
> condition happens when MCG_STATUS.MCIP is already set, that would cause 
> spontaneous 
> shutdown.

That's understood.

> If another NMI arrives on the CPU doing the wrmsr, it will be pended
> in the lapic and delivered after the wrmsr returns. wrmsr flow
> can't be interrupted. 

Of course.

Neither part of your response is an argument against adding the same
"defense" to the #MC handler, though. While likely #MC will be fatal
to the system anyway, we should try to avoid making things worse
when we can.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/microcode_amd.c b/xen/arch/x86/microcode_amd.c
index c819028..b64a58d 100644
--- a/xen/arch/x86/microcode_amd.c
+++ b/xen/arch/x86/microcode_amd.c
@@ -74,9 +74,6 @@  struct mpbhdr {
     uint8_t data[];
 };
 
-/* serialize access to the physical write */
-static DEFINE_SPINLOCK(microcode_update_lock);
-
 /* See comment in start_update() for cases when this routine fails */
 static int collect_cpu_info(struct cpu_signature *csig)
 {
@@ -251,7 +248,6 @@  static enum microcode_match_result compare_patch(
 
 static int apply_microcode(const struct microcode_patch *patch)
 {
-    unsigned long flags;
     uint32_t rev;
     int hw_err;
     unsigned int cpu = smp_processor_id();
@@ -263,15 +259,13 @@  static int apply_microcode(const struct microcode_patch *patch)
 
     hdr = patch->mc_amd->mpb;
 
-    spin_lock_irqsave(&microcode_update_lock, flags);
+    BUG_ON(local_irq_is_enabled());
 
     hw_err = wrmsr_safe(MSR_AMD_PATCHLOADER, (unsigned long)hdr);
 
     /* get patch id after patching */
     rdmsrl(MSR_AMD_PATCHLEVEL, rev);
 
-    spin_unlock_irqrestore(&microcode_update_lock, flags);
-
     /*
      * Some processors leave the ucode blob mapping as UC after the update.
      * Flush the mapping to regain normal cacheability.
diff --git a/xen/arch/x86/microcode_intel.c b/xen/arch/x86/microcode_intel.c
index bfb48ce..94a1561 100644
--- a/xen/arch/x86/microcode_intel.c
+++ b/xen/arch/x86/microcode_intel.c
@@ -93,9 +93,6 @@  struct extended_sigtable {
 
 #define exttable_size(et) ((et)->count * EXT_SIGNATURE_SIZE + EXT_HEADER_SIZE)
 
-/* serialize access to the physical write to MSR 0x79 */
-static DEFINE_SPINLOCK(microcode_update_lock);
-
 static int collect_cpu_info(struct cpu_signature *csig)
 {
     unsigned int cpu_num = smp_processor_id();
@@ -295,7 +292,6 @@  static struct microcode_patch *allow_microcode_patch(
 
 static int apply_microcode(const struct microcode_patch *patch)
 {
-    unsigned long flags;
     uint64_t msr_content;
     unsigned int val[2];
     unsigned int cpu_num = raw_smp_processor_id();
@@ -307,8 +303,7 @@  static int apply_microcode(const struct microcode_patch *patch)
 
     mc_intel = patch->mc_intel;
 
-    /* serialize access to the physical write to MSR 0x79 */
-    spin_lock_irqsave(&microcode_update_lock, flags);
+    BUG_ON(local_irq_is_enabled());
 
     /*
      * Writeback and invalidate caches before updating microcode to avoid
@@ -327,7 +322,6 @@  static int apply_microcode(const struct microcode_patch *patch)
     rdmsrl(MSR_IA32_UCODE_REV, msr_content);
     val[1] = (uint32_t)(msr_content >> 32);
 
-    spin_unlock_irqrestore(&microcode_update_lock, flags);
     if ( val[1] != mc_intel->hdr.rev )
     {
         printk(KERN_ERR "microcode: CPU%d update from revision "