x86/ucode: Refresh raw CPU policy after microcode load

Message ID	20230503185813.3050382-1-andrew.cooper3@citrix.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> IronPort-Data: A9a23:8svK163+VERbbQF1EPbD5atxkn2cJEfYwER7XKvMYLTBsI5bpzMFx mAbWDvTaPePazb1L951OYS2pk8Gvp6Hx9I3Hlc4pC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+XuDgNyo4GlD5gFmPqgQ1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfDGhBx P40JyI0X1OGl9Ob7L/mb+J+mZF2RCXrFNt3VnBIyDjYCbAtQIzZQrWM7thdtNsyrpkQR7CEP ZNfMGcxKk2aOHWjOX9OYH46tM6uimPybHtzr1WNqLBsy2PS0BZwwP7mN9+9ltmiHJwNxB7E+ DyYl4j/KjQBJMSazzPYyy2Pr8PMrA/cdZ0dJqLto5aGh3XMnzdOWXX6T2CTvv2RmkO4HdVFJ CQ86ico6KQ/6kGvZt38RAGj5m6JuAYGXNhdGPF87xuCooL2yQuEAmkPThZadccr8sQxQFQXO kShxo2zQ2Y16fvMFCzbr+3Pxd+vBcQLBWILah4GYQQX2uigpZECoz7CE/NoArHg27UZBgrML yC2QDkW3utD1pZSjfXkojgrkBr3+MGXE1ddChH/Gzv8s1gnPNPNi5mAswCz0BpWEGqOorBtV lAgktPW0u0BBIrleMelELRUR+HBCxpo3VThbb9T83oJrW7FF4aLJ9w43d2EGG9nM9wfZRjia 1LJtAVa6fd7ZSX6NvcqON/oUZ5zncAM8OjYug38NIISMvCdiifelM2RWaJg9z+0yxV9+U3OE ZyabdytHR4nNEiT9xLvH711+eZylkgDKZb7GciT5w65yoCXeHP9Ye5DaDNimMhltvLbyOgUm v4DX/a3J+J3CbCnPnWKqtZMdTjn7xETXPjLliCeTcbbSiIOJY3rI6W5LW8JE2C9o5loqw== IronPort-HdrOrdr: A9a23:5hNiYKgNQUymdnn2984wpRfx6XBQXuQji2hC6mlwRA09TyX4ra GTdZEgvnXJYVkqKRIdcLy7VZVoOEmsk6KdgrN+AV7BZmXbUQKTRelfBeGL+UyYJ8SUzIFgPM lbE5SXBLXLfDpHZcuT2njeLz4rqOP3lZxBio/lvhNQcT0= From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> CC: Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <JBeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Wei Liu <wl@xen.org> Subject: [PATCH] x86/ucode: Refresh raw CPU policy after microcode load Date: Wed, 3 May 2023 19:58:13 +0100 Message-ID: <20230503185813.3050382-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	x86/ucode: Refresh raw CPU policy after microcode load \| expand x86/ucode: Refresh raw CPU policy after microcode load

Andrew Cooper May 3, 2023, 6:58 p.m. UTC

Loading microcode can cause new features to appear.  This has happened
routinely since Spectre/Meltdown, and even the presence of new status bits can
mean the administrator has no further work to perform.

Refresh the raw CPU policy after late microcode load, so xen-cpuid can reflect
the updated state of the system.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

This is also the first step of being able to livepatch support for new
functionality in microcode.
---
 xen/arch/x86/cpu-policy.c             | 6 +++---
 xen/arch/x86/cpu/microcode/core.c     | 4 ++++
 xen/arch/x86/include/asm/cpu-policy.h | 6 ++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

Jan Beulich May 4, 2023, 7:08 a.m. UTC | #1

On 03.05.2023 20:58, Andrew Cooper wrote:
> Loading microcode can cause new features to appear.

Or disappear (LWP)? While I don't think we want to panic() in this
case (we do on the S3 resume path when recheck_cpu_features() fails
on the BSP), it would seem to me that we want to go a step further
than you do and at least warn when a feature went amiss. Or is that
too much of a scope-creeping request right here?

> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>          spin_lock(&microcode_mutex);
>          microcode_update_cache(patch);
>          spin_unlock(&microcode_mutex);
> +
> +        /* Refresh the raw CPU policy, in case the features have changed. */
> +        calculate_raw_cpu_policy();

I understand this is in line with what we do during boot, but there
and here I wonder whether this wouldn't better deal with possible
asymmetries (e.g. in case ucode loading failed on one of the CPUs),
along the lines of what we do near the end of identify_cpu() for
APs. (Unlike the question higher up, this is definitely only a
remark here, not something I'd consider dealing with right in this
change.)

Jan

Andrew Cooper May 4, 2023, 8:08 a.m. UTC | #2

On 04/05/2023 8:08 am, Jan Beulich wrote:
> On 03.05.2023 20:58, Andrew Cooper wrote:
>> Loading microcode can cause new features to appear.
> Or disappear (LWP)? While I don't think we want to panic() in this
> case (we do on the S3 resume path when recheck_cpu_features() fails
> on the BSP), it would seem to me that we want to go a step further
> than you do and at least warn when a feature went amiss. Or is that
> too much of a scope-creeping request right here?

You're correct that I ought to discuss the disappear case.  But like
livepatching, it's firmly in the realm of "the end user takes
responsibility for trying this in their test system before running it in
production".

For LWP specifically, we ought to explicitly permit its disappearance in
recheck_cpu_features(), because this specific example is known to exist,
and known safe as Xen never used or virtualised LWP functionality. 
Crashing on S3

>
>> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>>          spin_lock(&microcode_mutex);
>>          microcode_update_cache(patch);
>>          spin_unlock(&microcode_mutex);
>> +
>> +        /* Refresh the raw CPU policy, in case the features have changed. */
>> +        calculate_raw_cpu_policy();
> I understand this is in line with what we do during boot, but there
> and here I wonder whether this wouldn't better deal with possible
> asymmetries (e.g. in case ucode loading failed on one of the CPUs),
> along the lines of what we do near the end of identify_cpu() for
> APs. (Unlike the question higher up, this is definitely only a
> remark here, not something I'd consider dealing with right in this
> change.)

Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
principle, but Xen has no way of letting you explicitly get into that
situation.

This too falls firmly into the "end user takes responsibility for
testing it properly first" category.

We have explicit symmetric assumptions/requirements elsewhere (e.g. for
a single system, there's 1 correct ucode blob).

We can acknowledge that asymmetry exists, but there is basically nothing
Xen can do about it other than highlight that something is very wrong on
the system.  Odds are that a system which gets into such a state won't
survive much longer.

~Andrew

Jan Beulich May 4, 2023, 8:17 a.m. UTC | #3

On 04.05.2023 10:08, Andrew Cooper wrote:
> On 04/05/2023 8:08 am, Jan Beulich wrote:
>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>> Loading microcode can cause new features to appear.
>> Or disappear (LWP)? While I don't think we want to panic() in this
>> case (we do on the S3 resume path when recheck_cpu_features() fails
>> on the BSP), it would seem to me that we want to go a step further
>> than you do and at least warn when a feature went amiss. Or is that
>> too much of a scope-creeping request right here?
> 
> You're correct that I ought to discuss the disappear case.  But like
> livepatching, it's firmly in the realm of "the end user takes
> responsibility for trying this in their test system before running it in
> production".

Okay, with the case at least suitably mentioned
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> For LWP specifically, we ought to explicitly permit its disappearance in
> recheck_cpu_features(), because this specific example is known to exist,
> and known safe as Xen never used or virtualised LWP functionality. 

Right, but iirc we did expose it to guests earlier on. And I've used it as
a known example only anyway. Who knows what vendors decide to make disappear
the next time round ...

> Crashing on S3

I may guess you meant to continue "... is bad anyway"?

>>> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>>>          spin_lock(&microcode_mutex);
>>>          microcode_update_cache(patch);
>>>          spin_unlock(&microcode_mutex);
>>> +
>>> +        /* Refresh the raw CPU policy, in case the features have changed. */
>>> +        calculate_raw_cpu_policy();
>> I understand this is in line with what we do during boot, but there
>> and here I wonder whether this wouldn't better deal with possible
>> asymmetries (e.g. in case ucode loading failed on one of the CPUs),
>> along the lines of what we do near the end of identify_cpu() for
>> APs. (Unlike the question higher up, this is definitely only a
>> remark here, not something I'd consider dealing with right in this
>> change.)
> 
> Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
> principle, but Xen has no way of letting you explicitly get into that
> situation.
> 
> This too falls firmly into the "end user takes responsibility for
> testing it properly first" category.
> 
> We have explicit symmetric assumptions/requirements elsewhere (e.g. for
> a single system, there's 1 correct ucode blob).
> 
> We can acknowledge that asymmetry exists, but there is basically nothing
> Xen can do about it other than highlight that something is very wrong on
> the system.  Odds are that a system which gets into such a state won't
> survive much longer.

Indeed. Hence the desire to at least log the fact, such that investigation
of the sudden death won't take long.

Jan

Roger Pau Monne May 4, 2023, 8:22 a.m. UTC | #4

On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
> On 04/05/2023 8:08 am, Jan Beulich wrote:
> > On 03.05.2023 20:58, Andrew Cooper wrote:
> >> Loading microcode can cause new features to appear.
> > Or disappear (LWP)? While I don't think we want to panic() in this
> > case (we do on the S3 resume path when recheck_cpu_features() fails
> > on the BSP), it would seem to me that we want to go a step further
> > than you do and at least warn when a feature went amiss. Or is that
> > too much of a scope-creeping request right here?
> 
> You're correct that I ought to discuss the disappear case.  But like
> livepatching, it's firmly in the realm of "the end user takes
> responsibility for trying this in their test system before running it in
> production".
> 
> For LWP specifically, we ought to explicitly permit its disappearance in
> recheck_cpu_features(), because this specific example is known to exist,
> and known safe as Xen never used or virtualised LWP functionality. 
> Crashing on S3

Printing disappearing features might be a nice bonus, but could be
done from the tool that loads the microcode itself, no need to do such
work in the hypervisor IMO.

> >
> >> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
> >>          spin_lock(&microcode_mutex);
> >>          microcode_update_cache(patch);
> >>          spin_unlock(&microcode_mutex);
> >> +
> >> +        /* Refresh the raw CPU policy, in case the features have changed. */
> >> +        calculate_raw_cpu_policy();
> > I understand this is in line with what we do during boot, but there
> > and here I wonder whether this wouldn't better deal with possible
> > asymmetries (e.g. in case ucode loading failed on one of the CPUs),
> > along the lines of what we do near the end of identify_cpu() for
> > APs. (Unlike the question higher up, this is definitely only a
> > remark here, not something I'd consider dealing with right in this
> > change.)
> 
> Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
> principle, but Xen has no way of letting you explicitly get into that
> situation.
> 
> This too falls firmly into the "end user takes responsibility for
> testing it properly first" category.
> 
> We have explicit symmetric assumptions/requirements elsewhere (e.g. for
> a single system, there's 1 correct ucode blob).
> 
> We can acknowledge that asymmetry exists, but there is basically nothing
> Xen can do about it other than highlight that something is very wrong on
> the system.  Odds are that a system which gets into such a state won't
> survive much longer.

Would it make sense to only update the CPU policy if updated ==
nr_cores?

Thanks, Roger.

Jan Beulich May 4, 2023, 9:07 a.m. UTC | #5

On 04.05.2023 10:22, Roger Pau Monné wrote:
> On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>> Loading microcode can cause new features to appear.
>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>> on the BSP), it would seem to me that we want to go a step further
>>> than you do and at least warn when a feature went amiss. Or is that
>>> too much of a scope-creeping request right here?
>>
>> You're correct that I ought to discuss the disappear case.  But like
>> livepatching, it's firmly in the realm of "the end user takes
>> responsibility for trying this in their test system before running it in
>> production".
>>
>> For LWP specifically, we ought to explicitly permit its disappearance in
>> recheck_cpu_features(), because this specific example is known to exist,
>> and known safe as Xen never used or virtualised LWP functionality. 
>> Crashing on S3
> 
> Printing disappearing features might be a nice bonus, but could be
> done from the tool that loads the microcode itself, no need to do such
> work in the hypervisor IMO.

Except that the system may not last long enough for the (or any) tool
to actually make it to producing such output, let alone any human
actually observing it (when that output isn't necessarily going to
some remote system).

Jan

Andrew Cooper May 4, 2023, 9:28 a.m. UTC | #6

On 04/05/2023 10:07 am, Jan Beulich wrote:
> On 04.05.2023 10:22, Roger Pau Monné wrote:
>> On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
>>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>>> Loading microcode can cause new features to appear.
>>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>>> on the BSP), it would seem to me that we want to go a step further
>>>> than you do and at least warn when a feature went amiss. Or is that
>>>> too much of a scope-creeping request right here?
>>> You're correct that I ought to discuss the disappear case.  But like
>>> livepatching, it's firmly in the realm of "the end user takes
>>> responsibility for trying this in their test system before running it in
>>> production".
>>>
>>> For LWP specifically, we ought to explicitly permit its disappearance in
>>> recheck_cpu_features(), because this specific example is known to exist,
>>> and known safe as Xen never used or virtualised LWP functionality. 
>>> Crashing on S3
>> Printing disappearing features might be a nice bonus, but could be
>> done from the tool that loads the microcode itself, no need to do such
>> work in the hypervisor IMO.

Xen is really the only entity in a position to judge when stuff has gone
away, so this can't really be deferred to another tool.

We have the X86_FEATURE_* names during boot for parsing the
{dom0-}cpuid= command line options, but we don't keep this beyond init

> Except that the system may not last long enough for the (or any) tool
> to actually make it to producing such output, let alone any human
> actually observing it (when that output isn't necessarily going to
> some remote system).

Yeah, `xl dmesg`/serial/whatever is the one place liable for anything
like this to get noticed.

~Andrew

Andrew Cooper May 4, 2023, 10:19 a.m. UTC | #7

On 04/05/2023 9:17 am, Jan Beulich wrote:
> On 04.05.2023 10:08, Andrew Cooper wrote:
>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>> Loading microcode can cause new features to appear.
>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>> on the BSP), it would seem to me that we want to go a step further
>>> than you do and at least warn when a feature went amiss. Or is that
>>> too much of a scope-creeping request right here?
>> You're correct that I ought to discuss the disappear case.  But like
>> livepatching, it's firmly in the realm of "the end user takes
>> responsibility for trying this in their test system before running it in
>> production".
> Okay, with the case at least suitably mentioned
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

>> For LWP specifically, we ought to explicitly permit its disappearance in
>> recheck_cpu_features(), because this specific example is known to exist,
>> and known safe as Xen never used or virtualised LWP functionality. 
> Right, but iirc we did expose it to guests earlier on. And I've used it as
> a known example only anyway. Who knows what vendors decide to make disappear
> the next time round ...

It's true that we used to expose the CPUID bit to guests, but IIRC we
never virtualised it correctly which was my justification for hiding the
feature bit when I was doing the original work to rationalise what
guests saw.

Removing LWP was an extraordinary situation, and AMD didn't do it lightly.

What they did was sacrifice a fairly expensive LWP errata workaround to
make space ucode space for IBPB instead.  They hid the CPUID bit (and
specifically by modifying the CPUID mask MSR only) because there were no
known production users of LWP, and a consequence of the patch, an errata
got unfixed.

On real AMD systems which used to enumerate LWP, and subsequently ceased
to, it was only "normal virt" levels of feature hiding.  The LWP CPUID
leaf still has real data in it, and you can still use %xcr0.lwp, and the
LWP MSRs still function.

Software which was genuinely using LWP, not tickling the errata case,
and late loaded this patch wouldn't actually have anything malfunction
underfoot.

>> Crashing on S3
> I may guess you meant to continue "... is bad anyway"?

Oops.  I was clearly in a rush for my morning meeting.  What I meant was
"we probably shouldn't crash on S3 resume in this known-ok case".

~Andrew

x86/ucode: Refresh raw CPU policy after microcode load

Commit Message

Comments

Patch