[v7,09/10] microcode: remove microcode_update_lock

Message ID	1558945891-3015-10-git-send-email-chao.gao@intel.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> From: Chao Gao <chao.gao@intel.com> To: xen-devel@lists.xenproject.org Date: Mon, 27 May 2019 16:31:30 +0800 Message-Id: <1558945891-3015-10-git-send-email-chao.gao@intel.com> In-Reply-To: <1558945891-3015-1-git-send-email-chao.gao@intel.com> References: <1558945891-3015-1-git-send-email-chao.gao@intel.com> Subject: [Xen-devel] [PATCH v7 09/10] microcode: remove microcode_update_lock Precedence: list Cc: Sergey Dyasli <sergey.dyasli@citrix.com>, Ashok Raj <ashok.raj@intel.com>, Wei Liu <wl@xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <jbeulich@suse.com>, Chao Gao <chao.gao@intel.com>, =?utf-8?q?R?= =?utf-8?q?oger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	improve late microcode loading \| expand [v7,00/10] improve late microcode loading [v7,01/10] misc/xen-ucode: Upload a microcode blob to the hypervisor [v7,02/10] microcode/intel: extend microcode_update_match() [v7,03/10] microcode: introduce a global cache of ucode patch [v7,04/10] microcode: remove struct ucode_cpu_info [v7,05/10] microcode: remove pointless 'cpu' parameter [v7,06/10] microcode: split out apply_microcode() from cpu_request_microcode() [v7,07/10] microcode/intel: Writeback and invalidate caches before updating microcode [v7,08/10] x86/microcode: Synchronize late microcode loading [v7,09/10] microcode: remove microcode_update_lock [v7,10/10] x86/microcode: always collect_cpu_info() during boot

Chao Gao May 27, 2019, 8:31 a.m. UTC

microcode_update_lock is to prevent logic threads of a same core from
updating microcode at the same time. But due to using a global lock, it
also prevented parallel microcode updating on different cores.

Remove this lock in order to update microcode in parallel. It is safe
because we have already ensured serialization of sibling threads at the
caller side.
1.For late microcode update, do_microcode_update() ensures that only one
  sibiling thread of a core can update microcode.
2.For microcode update during system startup or CPU-hotplug,
  microcode_mutex() guarantees update serialization of logical threads.
3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
  late microcode update.

Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
only) are still processed sequentially.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
Changes in v7:
 - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
 patch to be applied is passed as an argument without any global variable. Thus
 no lock is added to serialize potential readers/writers. Callers of
 apply_microcode() will guarantee the correctness: the patch poninted by the
 arguments won't be changed by others.

Changes in v6:
 - introduce early_ucode_update_lock to serialize early ucode update.

Changes in v5:
 - newly add
---
 xen/arch/x86/microcode_amd.c   | 8 +-------
 xen/arch/x86/microcode_intel.c | 8 +-------
 2 files changed, 2 insertions(+), 14 deletions(-)

Roger Pau Monné June 5, 2019, 2:52 p.m. UTC | #1

On Mon, May 27, 2019 at 04:31:30PM +0800, Chao Gao wrote:
> microcode_update_lock is to prevent logic threads of a same core from
> updating microcode at the same time. But due to using a global lock, it
> also prevented parallel microcode updating on different cores.

Oh, OK, so that's what I was missing from patch 8 and what serializes
the updating.

> Remove this lock in order to update microcode in parallel. It is safe
> because we have already ensured serialization of sibling threads at the
> caller side.

Then you certainly need to fix the wait loop in do_microcode_update to
only wait for MICROCODE_UPDATE_TIMEOUT_US regardless of the number of
CPUs in the system?

> 1.For late microcode update, do_microcode_update() ensures that only one
>   sibiling thread of a core can update microcode.
> 2.For microcode update during system startup or CPU-hotplug,
>   microcode_mutex() guarantees update serialization of logical threads.
> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>   late microcode update.
> 
> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
> only) are still processed sequentially.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Patch LGTM, but it needs to fix the wait loop in do_microcode_update.

Thanks, Roger.

Jan Beulich June 5, 2019, 2:53 p.m. UTC | #2

>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
> microcode_update_lock is to prevent logic threads of a same core from
> updating microcode at the same time. But due to using a global lock, it
> also prevented parallel microcode updating on different cores.
> 
> Remove this lock in order to update microcode in parallel. It is safe
> because we have already ensured serialization of sibling threads at the
> caller side.
> 1.For late microcode update, do_microcode_update() ensures that only one
>   sibiling thread of a core can update microcode.
> 2.For microcode update during system startup or CPU-hotplug,
>   microcode_mutex() guarantees update serialization of logical threads.
> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>   late microcode update.
> 
> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
> only) are still processed sequentially.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> ---
> Changes in v7:
>  - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
>  patch to be applied is passed as an argument without any global variable. Thus
>  no lock is added to serialize potential readers/writers. Callers of
>  apply_microcode() will guarantee the correctness: the patch poninted by the
>  arguments won't be changed by others.

Much better this way indeed.

> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>  
>      mc_intel = patch->mc_intel;
>  
> -    /* serialize access to the physical write to MSR 0x79 */
> -    spin_lock_irqsave(&microcode_update_lock, flags);
> +    BUG_ON(local_irq_is_enabled());
>  
>      /*
>       * Writeback and invalidate caches before updating microcode to avoid

Thinking about it - what happens if we hit an NMI or #MC here?
watchdog_disable(), a call to which you add in an earlier patch,
doesn't really suppress the generation of NMIs, it only tells the
handler not to look at the accumulated statistics.

Jan

Jan Beulich June 5, 2019, 3:15 p.m. UTC | #3

>>> On 05.06.19 at 16:52, <roger.pau@citrix.com> wrote:
> On Mon, May 27, 2019 at 04:31:30PM +0800, Chao Gao wrote:
>> microcode_update_lock is to prevent logic threads of a same core from
>> updating microcode at the same time. But due to using a global lock, it
>> also prevented parallel microcode updating on different cores.
> 
> Oh, OK, so that's what I was missing from patch 8 and what serializes
> the updating.
> 
>> Remove this lock in order to update microcode in parallel. It is safe
>> because we have already ensured serialization of sibling threads at the
>> caller side.
> 
> Then you certainly need to fix the wait loop in do_microcode_update to
> only wait for MICROCODE_UPDATE_TIMEOUT_US regardless of the number of
> CPUs in the system?

Well, no, not exactly. On huge systems it may indeed still take longer
than on smaller ones. The way the waiting is coded now (expecting
forward progress in every MICROCODE_UPDATE_TIMEOUT_US period, is,
I think, acceptable. Plus leaving that logic alone will avoid touching it
yet again when introducing serial application policy as an option.

Jan

Chao Gao June 11, 2019, 12:46 p.m. UTC | #4

On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>> microcode_update_lock is to prevent logic threads of a same core from
>> updating microcode at the same time. But due to using a global lock, it
>> also prevented parallel microcode updating on different cores.
>> 
>> Remove this lock in order to update microcode in parallel. It is safe
>> because we have already ensured serialization of sibling threads at the
>> caller side.
>> 1.For late microcode update, do_microcode_update() ensures that only one
>>   sibiling thread of a core can update microcode.
>> 2.For microcode update during system startup or CPU-hotplug,
>>   microcode_mutex() guarantees update serialization of logical threads.
>> 3.get/put_cpu_bitmaps() prevents the concurrency of CPU-hotplug and
>>   late microcode update.
>> 
>> Note that printk in apply_microcode() and svm_host_osvm_init() (for AMD
>> only) are still processed sequentially.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>
>Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> ---
>> Changes in v7:
>>  - reworked. Remove complex lock logics introduced in v5 and v6. The microcode
>>  patch to be applied is passed as an argument without any global variable. Thus
>>  no lock is added to serialize potential readers/writers. Callers of
>>  apply_microcode() will guarantee the correctness: the patch poninted by the
>>  arguments won't be changed by others.
>
>Much better this way indeed.
>
>> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>>  
>>      mc_intel = patch->mc_intel;
>>  
>> -    /* serialize access to the physical write to MSR 0x79 */
>> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> +    BUG_ON(local_irq_is_enabled());
>>  
>>      /*
>>       * Writeback and invalidate caches before updating microcode to avoid
>
>Thinking about it - what happens if we hit an NMI or #MC here?
>watchdog_disable(), a call to which you add in an earlier patch,
>doesn't really suppress the generation of NMIs, it only tells the
>handler not to look at the accumulated statistics.

I think they should be suppressed. Ashok, could you confirm it?

I will figure out how to suppress them in Xen.

Thanks
Chao

Jan Beulich June 11, 2019, 1:23 p.m. UTC | #5

>>> On 11.06.19 at 14:46, <chao.gao@intel.com> wrote:
> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>>> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
>>>  
>>>      mc_intel = patch->mc_intel;
>>>  
>>> -    /* serialize access to the physical write to MSR 0x79 */
>>> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>> +    BUG_ON(local_irq_is_enabled());
>>>  
>>>      /*
>>>       * Writeback and invalidate caches before updating microcode to avoid
>>
>>Thinking about it - what happens if we hit an NMI or #MC here?
>>watchdog_disable(), a call to which you add in an earlier patch,
>>doesn't really suppress the generation of NMIs, it only tells the
>>handler not to look at the accumulated statistics.
> 
> I think they should be suppressed. Ashok, could you confirm it?
> 
> I will figure out how to suppress them in Xen.

Well, afaik suppressing #MC is impossible. CR4.MCE clear leads
to immediate shutdown in case of a machine check, unless I'm
mistaken. The watchdog NMI, otoh, could of course be "properly"
disabled.

Jan

Ashok Raj June 11, 2019, 4:04 p.m. UTC | #6

On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
> >
> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch *patch)
> >>  
> >>      mc_intel = patch->mc_intel;
> >>  
> >> -    /* serialize access to the physical write to MSR 0x79 */
> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
> >> +    BUG_ON(local_irq_is_enabled());
> >>  
> >>      /*
> >>       * Writeback and invalidate caches before updating microcode to avoid
> >
> >Thinking about it - what happens if we hit an NMI or #MC here?
> >watchdog_disable(), a call to which you add in an earlier patch,
> >doesn't really suppress the generation of NMIs, it only tells the
> >handler not to look at the accumulated statistics.
> 
> I think they should be suppressed. Ashok, could you confirm it?

I think the only sources would be the watchdog as you pointed out
which we already touch to keep it from expiring. The perf counters
i'm not an expert in, but i'll check. When we are in stop_machine() type
flow, its not clear if any of those would fire. (I might be wrong, but let
me check).

#MC shouldn't fire once you entered the rendezvous, if it does its usually
fatal like a 3strike or something. Machine is going down at that time
so nothing we could do to handle.

Cheers,
Ashok

Jan Beulich June 12, 2019, 7:38 a.m. UTC | #7

>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>> >
>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> *patch)
>> >>  
>> >>      mc_intel = patch->mc_intel;
>> >>  
>> >> -    /* serialize access to the physical write to MSR 0x79 */
>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> >> +    BUG_ON(local_irq_is_enabled());
>> >>  
>> >>      /*
>> >>       * Writeback and invalidate caches before updating microcode to avoid
>> >
>> >Thinking about it - what happens if we hit an NMI or #MC here?
>> >watchdog_disable(), a call to which you add in an earlier patch,
>> >doesn't really suppress the generation of NMIs, it only tells the
>> >handler not to look at the accumulated statistics.
>> 
>> I think they should be suppressed. Ashok, could you confirm it?
> 
> I think the only sources would be the watchdog as you pointed out
> which we already touch to keep it from expiring. The perf counters
> i'm not an expert in, but i'll check. When we are in stop_machine() type
> flow, its not clear if any of those would fire. (I might be wrong, but let
> me check).

Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
how would it _not_ potentially fire?

> #MC shouldn't fire once you entered the rendezvous, if it does its usually
> fatal like a 3strike or something. Machine is going down at that time
> so nothing we could do to handle.

Right - as long as we assume that #MC would be fatal anyway,
there's no point in thinking about ways to suppress it. I guess
it is fatal (almost) always right now, but I don't think it ought to
be. It's just that no-one has the time and environment to make
it actually behave better.

Jan

Chao Gao June 13, 2019, 2:05 p.m. UTC | #8

On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>> >
>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>> *patch)
>>> >>  
>>> >>      mc_intel = patch->mc_intel;
>>> >>  
>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>> >> +    BUG_ON(local_irq_is_enabled());
>>> >>  
>>> >>      /*
>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>> >
>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>> >doesn't really suppress the generation of NMIs, it only tells the
>>> >handler not to look at the accumulated statistics.
>>> 
>>> I think they should be suppressed. Ashok, could you confirm it?
>> 
>> I think the only sources would be the watchdog as you pointed out
>> which we already touch to keep it from expiring. The perf counters
>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>> flow, its not clear if any of those would fire. (I might be wrong, but let
>> me check).
>
>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>how would it _not_ potentially fire?

We plan not to prevent NMI being fired. Instead, if one thread of a core
is updating microcode, other threads of this core would stop in the
handler of NMI until the update completion. Is this approach acceptable?

Thanks
Chao

Jan Beulich June 13, 2019, 2:08 p.m. UTC | #9

>>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
> On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>> >
>>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>>> *patch)
>>>> >>  
>>>> >>      mc_intel = patch->mc_intel;
>>>> >>  
>>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>>> >> +    BUG_ON(local_irq_is_enabled());
>>>> >>  
>>>> >>      /*
>>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>>> >
>>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>>> >doesn't really suppress the generation of NMIs, it only tells the
>>>> >handler not to look at the accumulated statistics.
>>>> 
>>>> I think they should be suppressed. Ashok, could you confirm it?
>>> 
>>> I think the only sources would be the watchdog as you pointed out
>>> which we already touch to keep it from expiring. The perf counters
>>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>>> flow, its not clear if any of those would fire. (I might be wrong, but let
>>> me check).
>>
>>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>>how would it _not_ potentially fire?
> 
> We plan not to prevent NMI being fired. Instead, if one thread of a core
> is updating microcode, other threads of this core would stop in the
> handler of NMI until the update completion. Is this approach acceptable?

Well, I have to return the question: It is you who knows what is or
is not acceptable while an ucode update is in progress. In particular
it obviously matters how much ucode is involved in the delivery of
an NMI (and in allowing the handler to get to the point where you'd
"stop" it).

If the approach you suggest is fine for the NMI case, I'd then wonder
if it couldn't also be used for the #MC one.

Jan

Chao Gao June 13, 2019, 2:58 p.m. UTC | #10

On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
>>>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
>> On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>>>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>>>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>>>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>>>>> >
>>>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
>>>> *patch)
>>>>> >>  
>>>>> >>      mc_intel = patch->mc_intel;
>>>>> >>  
>>>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>>>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>>>>> >> +    BUG_ON(local_irq_is_enabled());
>>>>> >>  
>>>>> >>      /*
>>>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>>>>> >
>>>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>>>>> >watchdog_disable(), a call to which you add in an earlier patch,
>>>>> >doesn't really suppress the generation of NMIs, it only tells the
>>>>> >handler not to look at the accumulated statistics.
>>>>> 
>>>>> I think they should be suppressed. Ashok, could you confirm it?
>>>> 
>>>> I think the only sources would be the watchdog as you pointed out
>>>> which we already touch to keep it from expiring. The perf counters
>>>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>>>> flow, its not clear if any of those would fire. (I might be wrong, but let
>>>> me check).
>>>
>>>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>>>how would it _not_ potentially fire?
>> 
>> We plan not to prevent NMI being fired. Instead, if one thread of a core
>> is updating microcode, other threads of this core would stop in the
>> handler of NMI until the update completion. Is this approach acceptable?
>
>Well, I have to return the question: It is you who knows what is or
>is not acceptable while an ucode update is in progress. In particular
>it obviously matters how much ucode is involved in the delivery of
>an NMI (and in allowing the handler to get to the point where you'd
>"stop" it).
>
>If the approach you suggest is fine for the NMI case,

Yes. It is fine. It is a suggestion from Ashok and what he is working
on in linux kernel. I just wanted to make sure you didn't oppose this
approach in Xen (considering disarming watchdog NMI might be an
alternative).

>I'd then wonder if it couldn't also be used for the #MC one.

I think no much pratical value for #MC because we still need to wait for
the callin of all threads. But as you and Ashok said, #MC is usually
fatal and machine goes down anyway.

Thanks
Chao

Ashok Raj June 13, 2019, 5:47 p.m. UTC | #11

On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
> >>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
> > On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
> >>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
> >>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
> >>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
> >>>> >
> >>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> >>> *patch)
> >>>> >>  
> >>>> >>      mc_intel = patch->mc_intel;
> >>>> >>  
> >>>> >> -    /* serialize access to the physical write to MSR 0x79 */
> >>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
> >>>> >> +    BUG_ON(local_irq_is_enabled());
> >>>> >>  
> >>>> >>      /*
> >>>> >>       * Writeback and invalidate caches before updating microcode to avoid
> >>>> >
> >>>> >Thinking about it - what happens if we hit an NMI or #MC here?
> >>>> >watchdog_disable(), a call to which you add in an earlier patch,
> >>>> >doesn't really suppress the generation of NMIs, it only tells the
> >>>> >handler not to look at the accumulated statistics.
> >>>> 
> >>>> I think they should be suppressed. Ashok, could you confirm it?
> >>> 
> >>> I think the only sources would be the watchdog as you pointed out
> >>> which we already touch to keep it from expiring. The perf counters
> >>> i'm not an expert in, but i'll check. When we are in stop_machine() type
> >>> flow, its not clear if any of those would fire. (I might be wrong, but let
> >>> me check).
> >>
> >>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
> >>how would it _not_ potentially fire?
> > 
> > We plan not to prevent NMI being fired. Instead, if one thread of a core
> > is updating microcode, other threads of this core would stop in the
> > handler of NMI until the update completion. Is this approach acceptable?
> 
> Well, I have to return the question: It is you who knows what is or
> is not acceptable while an ucode update is in progress. In particular
> it obviously matters how much ucode is involved in the delivery of
> an NMI (and in allowing the handler to get to the point where you'd
> "stop" it).
> 
> If the approach you suggest is fine for the NMI case, I'd then wonder
> if it couldn't also be used for the #MC one.

Architecturally only one #MC can be active in the system. If a new #MC 
condition happens when MCG_STATUS.MCIP is already set, that would cause spontaneous 
shutdown.

If another NMI arrives on the CPU doing the wrmsr, it will be pended
in the lapic and delivered after the wrmsr returns. wrmsr flow
can't be interrupted.

Jan Beulich June 14, 2019, 8:58 a.m. UTC | #12

>>> On 13.06.19 at 19:47, <ashok.raj@intel.com> wrote:
> On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
>> >>> On 13.06.19 at 16:05, <chao.gao@intel.com> wrote:
>> > On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
>> >>>>> On 11.06.19 at 18:04, <ashok.raj@intel.com> wrote:
>> >>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>> >>>> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>> >>>> >
>> >>>> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct microcode_patch 
> 
>> >>> *patch)
>> >>>> >>  
>> >>>> >>      mc_intel = patch->mc_intel;
>> >>>> >>  
>> >>>> >> -    /* serialize access to the physical write to MSR 0x79 */
>> >>>> >> -    spin_lock_irqsave(&microcode_update_lock, flags);
>> >>>> >> +    BUG_ON(local_irq_is_enabled());
>> >>>> >>  
>> >>>> >>      /*
>> >>>> >>       * Writeback and invalidate caches before updating microcode to avoid
>> >>>> >
>> >>>> >Thinking about it - what happens if we hit an NMI or #MC here?
>> >>>> >watchdog_disable(), a call to which you add in an earlier patch,
>> >>>> >doesn't really suppress the generation of NMIs, it only tells the
>> >>>> >handler not to look at the accumulated statistics.
>> >>>> 
>> >>>> I think they should be suppressed. Ashok, could you confirm it?
>> >>> 
>> >>> I think the only sources would be the watchdog as you pointed out
>> >>> which we already touch to keep it from expiring. The perf counters
>> >>> i'm not an expert in, but i'll check. When we are in stop_machine() type
>> >>> flow, its not clear if any of those would fire. (I might be wrong, but let
>> >>> me check).
>> >>
>> >>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
>> >>how would it _not_ potentially fire?
>> > 
>> > We plan not to prevent NMI being fired. Instead, if one thread of a core
>> > is updating microcode, other threads of this core would stop in the
>> > handler of NMI until the update completion. Is this approach acceptable?
>> 
>> Well, I have to return the question: It is you who knows what is or
>> is not acceptable while an ucode update is in progress. In particular
>> it obviously matters how much ucode is involved in the delivery of
>> an NMI (and in allowing the handler to get to the point where you'd
>> "stop" it).
>> 
>> If the approach you suggest is fine for the NMI case, I'd then wonder
>> if it couldn't also be used for the #MC one.
> 
> Architecturally only one #MC can be active in the system. If a new #MC 
> condition happens when MCG_STATUS.MCIP is already set, that would cause 
> spontaneous 
> shutdown.

That's understood.

> If another NMI arrives on the CPU doing the wrmsr, it will be pended
> in the lapic and delivered after the wrmsr returns. wrmsr flow
> can't be interrupted. 

Of course.

Neither part of your response is an argument against adding the same
"defense" to the #MC handler, though. While likely #MC will be fatal
to the system anyway, we should try to avoid making things worse
when we can.

Jan

[v7,09/10] microcode: remove microcode_update_lock

Commit Message

Comments

Patch