diff mbox series

[RFC,for-4.15] x86/msr: introduce an option for legacy MSR behavior selection

Message ID 20210301162357.76527-1-roger.pau@citrix.com (mailing list archive)
State New
Headers show
Series [RFC,for-4.15] x86/msr: introduce an option for legacy MSR behavior selection | expand

Commit Message

Roger Pau Monné March 1, 2021, 4:23 p.m. UTC
Introduce an option to allow selecting the legacy behavior for
accesses to MSRs not explicitly handled. Since commit
84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
handled by Xen result in the injection of a #GP to the guest. This is
a behavior change since previously a #GP was only injected if
accessing the MSR on the real hardware will also trigger a #GP.

This seems to be problematic for some guests, so introduce an option
to fallback to this legacy behavior. The main difference between what
was previously done is that the hardware MSR value is not leaked to
the guests on reads.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Note that this option is not made available to dom0. I'm not sure
whether it makes sense to do so, since anyone updating Xen to such
newer version will also likely pair it with a newish kernel that
doesn't require such workarounds.

RFC because there's still some debate as to how we should solve the
MSR issue, this is one possible way, but IMO we need to make a
decision soon-ish because of the release timeline.

Boris, could you please test with Solaris to see if this fixes the
issue?
---
 docs/man/xl.cfg.5.pod.in          | 11 +++++++++++
 tools/include/libxl.h             |  8 ++++++++
 tools/libs/light/libxl_types.idl  |  2 ++
 tools/libs/light/libxl_x86.c      |  5 +++++
 tools/xl/xl_parse.c               |  3 +++
 xen/arch/x86/domain.c             |  3 +++
 xen/arch/x86/hvm/svm/svm.c        |  9 +++++++++
 xen/arch/x86/hvm/vmx/vmx.c        | 11 +++++++++++
 xen/arch/x86/pv/emul-priv-op.c    | 10 ++++++++++
 xen/include/asm-x86/domain.h      |  3 +++
 xen/include/public/arch-x86/xen.h |  3 +++
 11 files changed, 68 insertions(+)

Comments

Ian Jackson March 1, 2021, 5:16 p.m. UTC | #1
Roger Pau Monne writes ("[PATCH RFC for-4.15] x86/msr: introduce an option for legacy MSR behavior selection"):
> Introduce an option to allow selecting the legacy behavior for
> accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This is
> a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware will also trigger a #GP.
> 
> This seems to be problematic for some guests, so introduce an option
> to fallback to this legacy behavior. The main difference between what
> was previously done is that the hardware MSR value is not leaked to
> the guests on reads.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> ---
> Note that this option is not made available to dom0. I'm not sure
> whether it makes sense to do so, since anyone updating Xen to such
> newer version will also likely pair it with a newish kernel that
> doesn't require such workarounds.
> 
> RFC because there's still some debate as to how we should solve the
> MSR issue, this is one possible way, but IMO we need to make a
> decision soon-ish because of the release timeline.
> 
> Boris, could you please test with Solaris to see if this fixes the
> issue?

So AIUI this patch is to make it possible for Xen 4.15 to behave like
Xen 4.14, thus avoiding a regression for these troublesome guests.

Have we diffed the result of this against 4.14 and if not would it be
a sensible thing to do ?

Ian.
Roger Pau Monné March 1, 2021, 5:52 p.m. UTC | #2
On Mon, Mar 01, 2021 at 05:16:34PM +0000, Ian Jackson wrote:
> Roger Pau Monne writes ("[PATCH RFC for-4.15] x86/msr: introduce an option for legacy MSR behavior selection"):
> > Introduce an option to allow selecting the legacy behavior for
> > accesses to MSRs not explicitly handled. Since commit
> > 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> > handled by Xen result in the injection of a #GP to the guest. This is
> > a behavior change since previously a #GP was only injected if
> > accessing the MSR on the real hardware will also trigger a #GP.
> > 
> > This seems to be problematic for some guests, so introduce an option
> > to fallback to this legacy behavior. The main difference between what
> > was previously done is that the hardware MSR value is not leaked to
> > the guests on reads.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > ---
> > Note that this option is not made available to dom0. I'm not sure
> > whether it makes sense to do so, since anyone updating Xen to such
> > newer version will also likely pair it with a newish kernel that
> > doesn't require such workarounds.
> > 
> > RFC because there's still some debate as to how we should solve the
> > MSR issue, this is one possible way, but IMO we need to make a
> > decision soon-ish because of the release timeline.
> > 
> > Boris, could you please test with Solaris to see if this fixes the
> > issue?
> 
> So AIUI this patch is to make it possible for Xen 4.15 to behave like
> Xen 4.14, thus avoiding a regression for these troublesome guests.

Yes, sorry I haven't provided a release executive summary, as I wasn't
sure this would be acceptable in it's current form. Can do if there's
consensus this is an acceptable fix.

> Have we diffed the result of this against 4.14 and if not would it be
> a sensible thing to do ?

I think there will likely bee too much noise, we have changed the MSR
handling a bit from 4.14, so it's likely a diff to 4.14 is not going
to be helpful as the context will have too many changes (albeit I
haven't tried the exercise myself).

Thanks, Roger.
Ian Jackson March 1, 2021, 5:57 p.m. UTC | #3
Roger Pau Monné writes ("Re: [PATCH RFC for-4.15] x86/msr: introduce an option for legacy MSR behavior selection"):
> On Mon, Mar 01, 2021 at 05:16:34PM +0000, Ian Jackson wrote:
> > So AIUI this patch is to make it possible for Xen 4.15 to behave like
> > Xen 4.14, thus avoiding a regression for these troublesome guests.
> 
> Yes, sorry I haven't provided a release executive summary, as I wasn't
> sure this would be acceptable in it's current form. Can do if there's
> consensus this is an acceptable fix.

Thanks for the information.  I am in favour of fixing this issue.
I have it on my 4.15 blockers list.

Thanks,
Ian.
Boris Ostrovsky March 1, 2021, 7:33 p.m. UTC | #4
On 3/1/21 11:23 AM, Roger Pau Monne wrote:
> Introduce an option to allow selecting the legacy behavior for
> accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This is
> a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware will also trigger a #GP.
>
> This seems to be problematic for some guests, so introduce an option
> to fallback to this legacy behavior. The main difference between what
> was previously done is that the hardware MSR value is not leaked to
> the guests on reads.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> ---
> Note that this option is not made available to dom0. I'm not sure
> whether it makes sense to do so, since anyone updating Xen to such
> newer version will also likely pair it with a newish kernel that
> doesn't require such workarounds.
>
> RFC because there's still some debate as to how we should solve the
> MSR issue, this is one possible way, but IMO we need to make a
> decision soon-ish because of the release timeline.
>
> Boris, could you please test with Solaris to see if this fixes the
> issue?


Yes, it does. Thanks.


-boris
Jan Beulich March 2, 2021, 11:16 a.m. UTC | #5
On 01.03.2021 17:23, Roger Pau Monne wrote:
> Introduce an option to allow selecting the legacy behavior for
> accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This is
> a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware will also trigger a #GP.
> 
> This seems to be problematic for some guests, so introduce an option
> to fallback to this legacy behavior. The main difference between what
> was previously done is that the hardware MSR value is not leaked to
> the guests on reads.

Looking at the WRMSR behavior for PV, what you introduce isn't
matching 4.14 behavior: If rdmsr_safe() failed, all that effected
was the issuing of a log message. The behavior you propose is
better, no question, but it shouldn't be described as matching
legacy behavior then.

Somewhat related to this I wonder whether MSR reads and writes
wouldn't better be controllable independently. It seems quite
likely that a kernel may have an issue only with reads.

Additionally I wonder whether it is a good idea to let these
events go silently.

> Note that this option is not made available to dom0. I'm not sure
> whether it makes sense to do so, since anyone updating Xen to such
> newer version will also likely pair it with a newish kernel that
> doesn't require such workarounds.

No, that's not an option imo. It was a Dom0 where I ran into the
problem causing me to submit "x86/PV: conditionally avoid raising
#GP for early guest MSR accesses". While I could upgrade that
system, I have reasons for not doing so. And while I could put a
more modern kernel on there, I'd prefer if the distro kernel
continued to work. (That's leaving aside that for unrelated
reasons building and using my own, newer kernel there is quite a
bit more difficult than on all other of my test systems.)

> RFC because there's still some debate as to how we should solve the
> MSR issue, this is one possible way, but IMO we need to make a
> decision soon-ish because of the release timeline.

Generally I think it would be far better from a user pov if
things worked out of the box, at least in cases where it can be
made work. Arguably for Solaris this isn't going to be possible
(leaving aside the non-option of fully returning back to original
behavior), but for the old-Linux-PV case I still think my proposed
change is better in this regard. I realize Andrew said (on irc)
that this is making the behavior even weaker. I take a different
perspective: Considering a guest will install exception handlers
at some point, I continue to fail to see what good it will do to
try to inject a #GP(0) when we know the guest will die because of
not having a handler registered just yet. It is a kernel flaw,
yes, but they ended up living with it merely because of our odd
prior behavior, so we can't put all the blame on them.

This isn't to say that I'm against propagating exceptions where
there's no alternative to delivering one. Also I'm certainly open
to further tighten the condition of when to zap such an exception
(e.g. only as long as there's no handler _and_ the guest is still
in UP mode, assuming of course this would still address the
observed issue).

> --- a/docs/man/xl.cfg.5.pod.in
> +++ b/docs/man/xl.cfg.5.pod.in
> @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
>  
>  =back
>  
> +=item B<msr_legacy_handling=BOOLEAN>
> +
> +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
> +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
> +mode will force Xen to replicate the behaviour from the hardware it's currently
> +running on in order to decide whether a #GP is injected to the guest.  Note
> +that the guest is never allowed access to unhandled MSRs, this option only
> +changes whether a #GP might be injected or not.

This description is appropriate for reads, but not for writes:
Whether a write would fault can only be known by trying a write,
yet for safety reasons we can't risk doing more than a read. An
option might be to make writes fault is the to be written value
differs from that which the probing read has returned (i.e. the
same condition which originally had caused a log message to appear
in 4.14 for PV guests).

> --- a/tools/libs/light/libxl_types.idl
> +++ b/tools/libs/light/libxl_types.idl
> @@ -644,6 +644,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
>      ("arch_arm", Struct(None, [("gic_version", libxl_gic_version),
>                                 ("vuart", libxl_vuart_type),
>                                ])),
> +    ("arch_x86", Struct(None, [("msr_legacy_handling", libxl_defbool),
> +                              ])),

Seeing this I'm wondering whether the entire set of arch_*
shouldn't be within a union. But afaics this would have further
implications on code elsewhere, so surely wouldn't want doing
right now.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -852,6 +852,9 @@ int arch_domain_create(struct domain *d,
>  
>      domain_cpu_policy_changed(d);
>  
> +    d->arch.msr_legacy_handling = config->arch.domain_flags &
> +                                  XEN_X86_LEGACY_MSR_HANDLING;

Somewhere you'd also need to refuse processing requests with any
of the other 31 bits set.

Jan
Roger Pau Monné March 2, 2021, 3 p.m. UTC | #6
On Tue, Mar 02, 2021 at 12:16:12PM +0100, Jan Beulich wrote:
> On 01.03.2021 17:23, Roger Pau Monne wrote:
> > Introduce an option to allow selecting the legacy behavior for
> > accesses to MSRs not explicitly handled. Since commit
> > 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> > handled by Xen result in the injection of a #GP to the guest. This is
> > a behavior change since previously a #GP was only injected if
> > accessing the MSR on the real hardware will also trigger a #GP.
> > 
> > This seems to be problematic for some guests, so introduce an option
> > to fallback to this legacy behavior. The main difference between what
> > was previously done is that the hardware MSR value is not leaked to
> > the guests on reads.
> 
> Looking at the WRMSR behavior for PV, what you introduce isn't
> matching 4.14 behavior: If rdmsr_safe() failed, all that effected
> was the issuing of a log message. The behavior you propose is
> better, no question, but it shouldn't be described as matching
> legacy behavior then.
> 
> Somewhat related to this I wonder whether MSR reads and writes
> wouldn't better be controllable independently. It seems quite
> likely that a kernel may have an issue only with reads.
> 
> Additionally I wonder whether it is a good idea to let these
> events go silently.
> 
> > Note that this option is not made available to dom0. I'm not sure
> > whether it makes sense to do so, since anyone updating Xen to such
> > newer version will also likely pair it with a newish kernel that
> > doesn't require such workarounds.
> 
> No, that's not an option imo. It was a Dom0 where I ran into the
> problem causing me to submit "x86/PV: conditionally avoid raising
> #GP for early guest MSR accesses". While I could upgrade that
> system, I have reasons for not doing so. And while I could put a
> more modern kernel on there, I'd prefer if the distro kernel
> continued to work. (That's leaving aside that for unrelated
> reasons building and using my own, newer kernel there is quite a
> bit more difficult than on all other of my test systems.)
> 
> > RFC because there's still some debate as to how we should solve the
> > MSR issue, this is one possible way, but IMO we need to make a
> > decision soon-ish because of the release timeline.
> 
> Generally I think it would be far better from a user pov if
> things worked out of the box, at least in cases where it can be
> made work. Arguably for Solaris this isn't going to be possible
> (leaving aside the non-option of fully returning back to original
> behavior), but for the old-Linux-PV case I still think my proposed
> change is better in this regard. I realize Andrew said (on irc)
> that this is making the behavior even weaker. I take a different
> perspective: Considering a guest will install exception handlers
> at some point, I continue to fail to see what good it will do to
> try to inject a #GP(0) when we know the guest will die because of
> not having a handler registered just yet. It is a kernel flaw,
> yes, but they ended up living with it merely because of our odd
> prior behavior, so we can't put all the blame on them.

My concern with this would mostly be with newer guests, in the sense
that people could come to rely on this behavior of not injecting a
#GP if the handler is not setup, which I think we should try to avoid.

One option would be to introduce a feature that could be used in the
elfnotes to signal that the kernel doesn't require such workarounds
for MSR #GP handling, maybe:

/*
 * Signal that the kernel doesn't require suppressing the #GP on
 * unknown MSR accesses if the handler is not setup. New kernels
 * should always have this set.
 */
#define XENFEAT_msr_early_gp   16

We could try to backport this on the Linux kernel as far as we can
that we know it's safe to do so.

If this is not acceptable then I guess your solution is fine. Arguably
PV has it's own (weird) architecture, in which it might be safe to say
that no #GP will be injected as a result of a MSR access unless the
handler is setup. Ideally we should document this behaviour somewhere.

Maybe we could add a rdmsr_safe to your path akin to what's done
here?

> 
> This isn't to say that I'm against propagating exceptions where
> there's no alternative to delivering one. Also I'm certainly open
> to further tighten the condition of when to zap such an exception
> (e.g. only as long as there's no handler _and_ the guest is still
> in UP mode, assuming of course this would still address the
> observed issue).
> 
> > --- a/docs/man/xl.cfg.5.pod.in
> > +++ b/docs/man/xl.cfg.5.pod.in
> > @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
> >  
> >  =back
> >  
> > +=item B<msr_legacy_handling=BOOLEAN>
> > +
> > +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
> > +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
> > +mode will force Xen to replicate the behaviour from the hardware it's currently
> > +running on in order to decide whether a #GP is injected to the guest.  Note
> > +that the guest is never allowed access to unhandled MSRs, this option only
> > +changes whether a #GP might be injected or not.
> 
> This description is appropriate for reads, but not for writes:
> Whether a write would fault can only be known by trying a write,
> yet for safety reasons we can't risk doing more than a read. An
> option might be to make writes fault is the to be written value
> differs from that which the probing read has returned (i.e. the
> same condition which originally had caused a log message to appear
> in 4.14 for PV guests).

Read values for unhandled MSRs will always be 0 with the proposed
code, so we would inject a #GP to the guest for any write attempt to
unhandled MSRs with a value different than 0.

Maybe we should just inject a #GP to the guest for any attempts to
write to unhandled MSRs?

> > --- a/tools/libs/light/libxl_types.idl
> > +++ b/tools/libs/light/libxl_types.idl
> > @@ -644,6 +644,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
> >      ("arch_arm", Struct(None, [("gic_version", libxl_gic_version),
> >                                 ("vuart", libxl_vuart_type),
> >                                ])),
> > +    ("arch_x86", Struct(None, [("msr_legacy_handling", libxl_defbool),
> > +                              ])),
> 
> Seeing this I'm wondering whether the entire set of arch_*
> shouldn't be within a union. But afaics this would have further
> implications on code elsewhere, so surely wouldn't want doing
> right now.

Right, I thought the same but I'm not sure we can change this now, as
it's part of the API. Adding new fields is fine, but adding a union
field with arch_arm would change the structure as we would need to add
a 'u' (or equivalent) field here.

> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -852,6 +852,9 @@ int arch_domain_create(struct domain *d,
> >  
> >      domain_cpu_policy_changed(d);
> >  
> > +    d->arch.msr_legacy_handling = config->arch.domain_flags &
> > +                                  XEN_X86_LEGACY_MSR_HANDLING;
> 
> Somewhere you'd also need to refuse processing requests with any
> of the other 31 bits set.

Yes, I need to sanitize the flags.

Thanks, Roger.
Jan Beulich March 2, 2021, 3:18 p.m. UTC | #7
On 02.03.2021 16:00, Roger Pau Monné wrote:
> On Tue, Mar 02, 2021 at 12:16:12PM +0100, Jan Beulich wrote:
>> On 01.03.2021 17:23, Roger Pau Monne wrote:
>>> RFC because there's still some debate as to how we should solve the
>>> MSR issue, this is one possible way, but IMO we need to make a
>>> decision soon-ish because of the release timeline.
>>
>> Generally I think it would be far better from a user pov if
>> things worked out of the box, at least in cases where it can be
>> made work. Arguably for Solaris this isn't going to be possible
>> (leaving aside the non-option of fully returning back to original
>> behavior), but for the old-Linux-PV case I still think my proposed
>> change is better in this regard. I realize Andrew said (on irc)
>> that this is making the behavior even weaker. I take a different
>> perspective: Considering a guest will install exception handlers
>> at some point, I continue to fail to see what good it will do to
>> try to inject a #GP(0) when we know the guest will die because of
>> not having a handler registered just yet. It is a kernel flaw,
>> yes, but they ended up living with it merely because of our odd
>> prior behavior, so we can't put all the blame on them.
> 
> My concern with this would mostly be with newer guests, in the sense
> that people could come to rely on this behavior of not injecting a
> #GP if the handler is not setup, which I think we should try to avoid.
> 
> One option would be to introduce a feature that could be used in the
> elfnotes to signal that the kernel doesn't require such workarounds
> for MSR #GP handling, maybe:
> 
> /*
>  * Signal that the kernel doesn't require suppressing the #GP on
>  * unknown MSR accesses if the handler is not setup. New kernels
>  * should always have this set.
>  */
> #define XENFEAT_msr_early_gp   16
> 
> We could try to backport this on the Linux kernel as far as we can
> that we know it's safe to do so.

I too did consider something like this. While I'm not opposed, it
effectively requires well-behaved consumers who haven't been well-
behaved in the past.

> If this is not acceptable then I guess your solution is fine. Arguably
> PV has it's own (weird) architecture, in which it might be safe to say
> that no #GP will be injected as a result of a MSR access unless the
> handler is setup. Ideally we should document this behaviour somewhere.
> 
> Maybe we could add a rdmsr_safe to your path akin to what's done
> here?

Probably. Would need trying out on the affected older version,
just like ...

>>> --- a/docs/man/xl.cfg.5.pod.in
>>> +++ b/docs/man/xl.cfg.5.pod.in
>>> @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
>>>  
>>>  =back
>>>  
>>> +=item B<msr_legacy_handling=BOOLEAN>
>>> +
>>> +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
>>> +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
>>> +mode will force Xen to replicate the behaviour from the hardware it's currently
>>> +running on in order to decide whether a #GP is injected to the guest.  Note
>>> +that the guest is never allowed access to unhandled MSRs, this option only
>>> +changes whether a #GP might be injected or not.
>>
>> This description is appropriate for reads, but not for writes:
>> Whether a write would fault can only be known by trying a write,
>> yet for safety reasons we can't risk doing more than a read. An
>> option might be to make writes fault is the to be written value
>> differs from that which the probing read has returned (i.e. the
>> same condition which originally had caused a log message to appear
>> in 4.14 for PV guests).
> 
> Read values for unhandled MSRs will always be 0 with the proposed
> code, so we would inject a #GP to the guest for any write attempt to
> unhandled MSRs with a value different than 0.
> 
> Maybe we should just inject a #GP to the guest for any attempts to
> write to unhandled MSRs?

... doing this would need to. Iirc I did add the write side of the
handling in my patch just for symmetry. I'd prefer handling to be
symmetric, but I can see why we may not want it to be so.

Jan
Roger Pau Monné March 3, 2021, 3:38 p.m. UTC | #8
On Tue, Mar 02, 2021 at 04:18:59PM +0100, Jan Beulich wrote:
> On 02.03.2021 16:00, Roger Pau Monné wrote:
> > On Tue, Mar 02, 2021 at 12:16:12PM +0100, Jan Beulich wrote:
> >> On 01.03.2021 17:23, Roger Pau Monne wrote:
> >>> RFC because there's still some debate as to how we should solve the
> >>> MSR issue, this is one possible way, but IMO we need to make a
> >>> decision soon-ish because of the release timeline.
> >>
> >> Generally I think it would be far better from a user pov if
> >> things worked out of the box, at least in cases where it can be
> >> made work. Arguably for Solaris this isn't going to be possible
> >> (leaving aside the non-option of fully returning back to original
> >> behavior), but for the old-Linux-PV case I still think my proposed
> >> change is better in this regard. I realize Andrew said (on irc)
> >> that this is making the behavior even weaker. I take a different
> >> perspective: Considering a guest will install exception handlers
> >> at some point, I continue to fail to see what good it will do to
> >> try to inject a #GP(0) when we know the guest will die because of
> >> not having a handler registered just yet. It is a kernel flaw,
> >> yes, but they ended up living with it merely because of our odd
> >> prior behavior, so we can't put all the blame on them.
> > 
> > My concern with this would mostly be with newer guests, in the sense
> > that people could come to rely on this behavior of not injecting a
> > #GP if the handler is not setup, which I think we should try to avoid.
> > 
> > One option would be to introduce a feature that could be used in the
> > elfnotes to signal that the kernel doesn't require such workarounds
> > for MSR #GP handling, maybe:
> > 
> > /*
> >  * Signal that the kernel doesn't require suppressing the #GP on
> >  * unknown MSR accesses if the handler is not setup. New kernels
> >  * should always have this set.
> >  */
> > #define XENFEAT_msr_early_gp   16
> > 
> > We could try to backport this on the Linux kernel as far as we can
> > that we know it's safe to do so.
> 
> I too did consider something like this. While I'm not opposed, it
> effectively requires well-behaved consumers who haven't been well-
> behaved in the past.
> 
> > If this is not acceptable then I guess your solution is fine. Arguably
> > PV has it's own (weird) architecture, in which it might be safe to say
> > that no #GP will be injected as a result of a MSR access unless the
> > handler is setup. Ideally we should document this behaviour somewhere.
> > 
> > Maybe we could add a rdmsr_safe to your path akin to what's done
> > here?
> 
> Probably. Would need trying out on the affected older version,
> just like ...
> 
> >>> --- a/docs/man/xl.cfg.5.pod.in
> >>> +++ b/docs/man/xl.cfg.5.pod.in
> >>> @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
> >>>  
> >>>  =back
> >>>  
> >>> +=item B<msr_legacy_handling=BOOLEAN>
> >>> +
> >>> +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
> >>> +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
> >>> +mode will force Xen to replicate the behaviour from the hardware it's currently
> >>> +running on in order to decide whether a #GP is injected to the guest.  Note
> >>> +that the guest is never allowed access to unhandled MSRs, this option only
> >>> +changes whether a #GP might be injected or not.
> >>
> >> This description is appropriate for reads, but not for writes:
> >> Whether a write would fault can only be known by trying a write,
> >> yet for safety reasons we can't risk doing more than a read. An
> >> option might be to make writes fault is the to be written value
> >> differs from that which the probing read has returned (i.e. the
> >> same condition which originally had caused a log message to appear
> >> in 4.14 for PV guests).
> > 
> > Read values for unhandled MSRs will always be 0 with the proposed
> > code, so we would inject a #GP to the guest for any write attempt to
> > unhandled MSRs with a value different than 0.
> > 
> > Maybe we should just inject a #GP to the guest for any attempts to
> > write to unhandled MSRs?
> 
> ... doing this would need to. Iirc I did add the write side of the
> handling in my patch just for symmetry. I'd prefer handling to be
> symmetric, but I can see why we may not want it to be so.

Kind of in the wrong context, but I will reply here because it's the
last message related to the MSR stuff.

Jan: would you be fine with modifying your path to not change the
behaviour for writes (ie: always inject #GP to the guest for unhandled
accesses), and then add a rdmsr_safe to the read path in order to
decide whether to inject a #GP to the guest even if the #GP handler is
not setup?

I can modify the option introduced on this patch to always inject #GP
on unhandled writes and only inject #GP on reads if the physical MSR
access on the hardware also triggers a #GP. HVM guests with broken
behavior will require having the option enabled in order to work,
but I think that's likely OK as long term we expect all HVM guests to
be well behaved.

My main worry with this approach is that we end up requiring half of
the common HVM guests OSes to have the 'legacy MSR handling' option
enabled in order to work. I think it's unlikely for this to happen, as
we are only aware of Solaris requiring it at the moment.

It also raises the question whether we will allow any more exceptions
to the MSR policy, like we did for Windows and OpenBSD in:

http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=ca88a43e660c75796656a544e54a648c60d26ef0
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=4175fd3ccd17face664036fa98e9329aa317f7b6

Or if we are just going to require those guests to enable the legacy
MSR handling instead.

Thanks, Roger.
Jan Beulich March 4, 2021, 8:48 a.m. UTC | #9
On 03.03.2021 16:38, Roger Pau Monné wrote:
> On Tue, Mar 02, 2021 at 04:18:59PM +0100, Jan Beulich wrote:
>> On 02.03.2021 16:00, Roger Pau Monné wrote:
>>> On Tue, Mar 02, 2021 at 12:16:12PM +0100, Jan Beulich wrote:
>>>> On 01.03.2021 17:23, Roger Pau Monne wrote:
>>>>> RFC because there's still some debate as to how we should solve the
>>>>> MSR issue, this is one possible way, but IMO we need to make a
>>>>> decision soon-ish because of the release timeline.
>>>>
>>>> Generally I think it would be far better from a user pov if
>>>> things worked out of the box, at least in cases where it can be
>>>> made work. Arguably for Solaris this isn't going to be possible
>>>> (leaving aside the non-option of fully returning back to original
>>>> behavior), but for the old-Linux-PV case I still think my proposed
>>>> change is better in this regard. I realize Andrew said (on irc)
>>>> that this is making the behavior even weaker. I take a different
>>>> perspective: Considering a guest will install exception handlers
>>>> at some point, I continue to fail to see what good it will do to
>>>> try to inject a #GP(0) when we know the guest will die because of
>>>> not having a handler registered just yet. It is a kernel flaw,
>>>> yes, but they ended up living with it merely because of our odd
>>>> prior behavior, so we can't put all the blame on them.
>>>
>>> My concern with this would mostly be with newer guests, in the sense
>>> that people could come to rely on this behavior of not injecting a
>>> #GP if the handler is not setup, which I think we should try to avoid.
>>>
>>> One option would be to introduce a feature that could be used in the
>>> elfnotes to signal that the kernel doesn't require such workarounds
>>> for MSR #GP handling, maybe:
>>>
>>> /*
>>>  * Signal that the kernel doesn't require suppressing the #GP on
>>>  * unknown MSR accesses if the handler is not setup. New kernels
>>>  * should always have this set.
>>>  */
>>> #define XENFEAT_msr_early_gp   16
>>>
>>> We could try to backport this on the Linux kernel as far as we can
>>> that we know it's safe to do so.
>>
>> I too did consider something like this. While I'm not opposed, it
>> effectively requires well-behaved consumers who haven't been well-
>> behaved in the past.
>>
>>> If this is not acceptable then I guess your solution is fine. Arguably
>>> PV has it's own (weird) architecture, in which it might be safe to say
>>> that no #GP will be injected as a result of a MSR access unless the
>>> handler is setup. Ideally we should document this behaviour somewhere.
>>>
>>> Maybe we could add a rdmsr_safe to your path akin to what's done
>>> here?
>>
>> Probably. Would need trying out on the affected older version,
>> just like ...
>>
>>>>> --- a/docs/man/xl.cfg.5.pod.in
>>>>> +++ b/docs/man/xl.cfg.5.pod.in
>>>>> @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
>>>>>  
>>>>>  =back
>>>>>  
>>>>> +=item B<msr_legacy_handling=BOOLEAN>
>>>>> +
>>>>> +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
>>>>> +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
>>>>> +mode will force Xen to replicate the behaviour from the hardware it's currently
>>>>> +running on in order to decide whether a #GP is injected to the guest.  Note
>>>>> +that the guest is never allowed access to unhandled MSRs, this option only
>>>>> +changes whether a #GP might be injected or not.
>>>>
>>>> This description is appropriate for reads, but not for writes:
>>>> Whether a write would fault can only be known by trying a write,
>>>> yet for safety reasons we can't risk doing more than a read. An
>>>> option might be to make writes fault is the to be written value
>>>> differs from that which the probing read has returned (i.e. the
>>>> same condition which originally had caused a log message to appear
>>>> in 4.14 for PV guests).
>>>
>>> Read values for unhandled MSRs will always be 0 with the proposed
>>> code, so we would inject a #GP to the guest for any write attempt to
>>> unhandled MSRs with a value different than 0.
>>>
>>> Maybe we should just inject a #GP to the guest for any attempts to
>>> write to unhandled MSRs?
>>
>> ... doing this would need to. Iirc I did add the write side of the
>> handling in my patch just for symmetry. I'd prefer handling to be
>> symmetric, but I can see why we may not want it to be so.
> 
> Kind of in the wrong context, but I will reply here because it's the
> last message related to the MSR stuff.
> 
> Jan: would you be fine with modifying your path to not change the
> behaviour for writes (ie: always inject #GP to the guest for unhandled
> accesses), and then add a rdmsr_safe to the read path in order to
> decide whether to inject a #GP to the guest even if the #GP handler is
> not setup?

I don't mind as long as it ends up making things work (I have no
reason to believe it won't). I'll give that a try.

> I can modify the option introduced on this patch to always inject #GP
> on unhandled writes and only inject #GP on reads if the physical MSR
> access on the hardware also triggers a #GP. HVM guests with broken
> behavior will require having the option enabled in order to work,
> but I think that's likely OK as long term we expect all HVM guests to
> be well behaved.
> 
> My main worry with this approach is that we end up requiring half of
> the common HVM guests OSes to have the 'legacy MSR handling' option
> enabled in order to work. I think it's unlikely for this to happen, as
> we are only aware of Solaris requiring it at the moment.
> 
> It also raises the question whether we will allow any more exceptions
> to the MSR policy, like we did for Windows and OpenBSD in:
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=ca88a43e660c75796656a544e54a648c60d26ef0
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=4175fd3ccd17face664036fa98e9329aa317f7b6
> 
> Or if we are just going to require those guests to enable the legacy
> MSR handling instead.

It is my understanding that Andrew's view is that adding such special
cases is the only acceptable thing. As voiced before I don't share
this view, as it means we're going to be entirely reactive to people
reporting issues, when I think we should be pro-active as far as is
reasonable. Independent of any pro-active measures there'll still be
enough reasons for reactive changes - for example I assume Linux
would now issue the log message from

	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {

		if (c->x86 > 0x10 ||
		    (c->x86 == 0x10 && c->x86_model >= 0x2)) {
			u64 val;

			rdmsrl(MSR_K7_HWCR, val);
			if (!(val & BIT(24)))
				pr_warn(FW_BUG "TSC doesn't count with P0 frequency!\n");
		}
	}

since we surface a zero value right now (but I didn't verify this in
practice yet).

Jan
Roger Pau Monné March 4, 2021, 10:05 a.m. UTC | #10
On Thu, Mar 04, 2021 at 09:48:25AM +0100, Jan Beulich wrote:
> On 03.03.2021 16:38, Roger Pau Monné wrote:
> > On Tue, Mar 02, 2021 at 04:18:59PM +0100, Jan Beulich wrote:
> >> On 02.03.2021 16:00, Roger Pau Monné wrote:
> >>> On Tue, Mar 02, 2021 at 12:16:12PM +0100, Jan Beulich wrote:
> >>>> On 01.03.2021 17:23, Roger Pau Monne wrote:
> >>>>> RFC because there's still some debate as to how we should solve the
> >>>>> MSR issue, this is one possible way, but IMO we need to make a
> >>>>> decision soon-ish because of the release timeline.
> >>>>
> >>>> Generally I think it would be far better from a user pov if
> >>>> things worked out of the box, at least in cases where it can be
> >>>> made work. Arguably for Solaris this isn't going to be possible
> >>>> (leaving aside the non-option of fully returning back to original
> >>>> behavior), but for the old-Linux-PV case I still think my proposed
> >>>> change is better in this regard. I realize Andrew said (on irc)
> >>>> that this is making the behavior even weaker. I take a different
> >>>> perspective: Considering a guest will install exception handlers
> >>>> at some point, I continue to fail to see what good it will do to
> >>>> try to inject a #GP(0) when we know the guest will die because of
> >>>> not having a handler registered just yet. It is a kernel flaw,
> >>>> yes, but they ended up living with it merely because of our odd
> >>>> prior behavior, so we can't put all the blame on them.
> >>>
> >>> My concern with this would mostly be with newer guests, in the sense
> >>> that people could come to rely on this behavior of not injecting a
> >>> #GP if the handler is not setup, which I think we should try to avoid.
> >>>
> >>> One option would be to introduce a feature that could be used in the
> >>> elfnotes to signal that the kernel doesn't require such workarounds
> >>> for MSR #GP handling, maybe:
> >>>
> >>> /*
> >>>  * Signal that the kernel doesn't require suppressing the #GP on
> >>>  * unknown MSR accesses if the handler is not setup. New kernels
> >>>  * should always have this set.
> >>>  */
> >>> #define XENFEAT_msr_early_gp   16
> >>>
> >>> We could try to backport this on the Linux kernel as far as we can
> >>> that we know it's safe to do so.
> >>
> >> I too did consider something like this. While I'm not opposed, it
> >> effectively requires well-behaved consumers who haven't been well-
> >> behaved in the past.
> >>
> >>> If this is not acceptable then I guess your solution is fine. Arguably
> >>> PV has it's own (weird) architecture, in which it might be safe to say
> >>> that no #GP will be injected as a result of a MSR access unless the
> >>> handler is setup. Ideally we should document this behaviour somewhere.
> >>>
> >>> Maybe we could add a rdmsr_safe to your path akin to what's done
> >>> here?
> >>
> >> Probably. Would need trying out on the affected older version,
> >> just like ...
> >>
> >>>>> --- a/docs/man/xl.cfg.5.pod.in
> >>>>> +++ b/docs/man/xl.cfg.5.pod.in
> >>>>> @@ -2861,6 +2861,17 @@ No MCA capabilities in above list are enabled.
> >>>>>  
> >>>>>  =back
> >>>>>  
> >>>>> +=item B<msr_legacy_handling=BOOLEAN>
> >>>>> +
> >>>>> +Select whether to use the legacy behaviour for accesses to MSRs not explicitly
> >>>>> +handled by Xen instead of injecting a #GP to the guest.  Such legacy access
> >>>>> +mode will force Xen to replicate the behaviour from the hardware it's currently
> >>>>> +running on in order to decide whether a #GP is injected to the guest.  Note
> >>>>> +that the guest is never allowed access to unhandled MSRs, this option only
> >>>>> +changes whether a #GP might be injected or not.
> >>>>
> >>>> This description is appropriate for reads, but not for writes:
> >>>> Whether a write would fault can only be known by trying a write,
> >>>> yet for safety reasons we can't risk doing more than a read. An
> >>>> option might be to make writes fault is the to be written value
> >>>> differs from that which the probing read has returned (i.e. the
> >>>> same condition which originally had caused a log message to appear
> >>>> in 4.14 for PV guests).
> >>>
> >>> Read values for unhandled MSRs will always be 0 with the proposed
> >>> code, so we would inject a #GP to the guest for any write attempt to
> >>> unhandled MSRs with a value different than 0.
> >>>
> >>> Maybe we should just inject a #GP to the guest for any attempts to
> >>> write to unhandled MSRs?
> >>
> >> ... doing this would need to. Iirc I did add the write side of the
> >> handling in my patch just for symmetry. I'd prefer handling to be
> >> symmetric, but I can see why we may not want it to be so.
> > 
> > Kind of in the wrong context, but I will reply here because it's the
> > last message related to the MSR stuff.
> > 
> > Jan: would you be fine with modifying your path to not change the
> > behaviour for writes (ie: always inject #GP to the guest for unhandled
> > accesses), and then add a rdmsr_safe to the read path in order to
> > decide whether to inject a #GP to the guest even if the #GP handler is
> > not setup?
> 
> I don't mind as long as it ends up making things work (I have no
> reason to believe it won't). I'll give that a try.

Thanks. I think this behavior would be a fine solution for PV guests
because:

 - It's unlikely we will see any new ports to x86 PV mode, and hence
   we mostly need to care about the existing ones which we already
   fixed in new versions, so this behavior is not to be propagated.
 - The Xen PV architecture is already different from the x86
   architecture, and hence we can slightly bend the rules to our
   liking.
 - It's not feasible for us to figure out what MSRs older Linux
   versions tried to access without having a #GP handler setup. Not
   sure whether this also affects NetBSD.

I would like to suggest to introduce a comment to document this
behaviour for x86 PV in the public headers, but I'm afraid I cannot
find a suitable location.

> > I can modify the option introduced on this patch to always inject #GP
> > on unhandled writes and only inject #GP on reads if the physical MSR
> > access on the hardware also triggers a #GP. HVM guests with broken
> > behavior will require having the option enabled in order to work,
> > but I think that's likely OK as long term we expect all HVM guests to
> > be well behaved.
> > 
> > My main worry with this approach is that we end up requiring half of
> > the common HVM guests OSes to have the 'legacy MSR handling' option
> > enabled in order to work. I think it's unlikely for this to happen, as
> > we are only aware of Solaris requiring it at the moment.
> > 
> > It also raises the question whether we will allow any more exceptions
> > to the MSR policy, like we did for Windows and OpenBSD in:
> > 
> > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=ca88a43e660c75796656a544e54a648c60d26ef0
> > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=4175fd3ccd17face664036fa98e9329aa317f7b6
> > 
> > Or if we are just going to require those guests to enable the legacy
> > MSR handling instead.
> 
> It is my understanding that Andrew's view is that adding such special
> cases is the only acceptable thing. As voiced before I don't share
> this view, as it means we're going to be entirely reactive to people
> reporting issues, when I think we should be pro-active as far as is
> reasonable. Independent of any pro-active measures there'll still be
> enough reasons for reactive changes - for example I assume Linux
> would now issue the log message from
> 
> 	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
> 
> 		if (c->x86 > 0x10 ||
> 		    (c->x86 == 0x10 && c->x86_model >= 0x2)) {
> 			u64 val;
> 
> 			rdmsrl(MSR_K7_HWCR, val);
> 			if (!(val & BIT(24)))
> 				pr_warn(FW_BUG "TSC doesn't count with P0 frequency!\n");
> 		}
> 	}
> 
> since we surface a zero value right now (but I didn't verify this in
> practice yet).

I think we inject a #GP to the guest if it tries to access
MSR_K7_HWCR? As I don't see this MSR handled explicitly in
svm_msr_read_intercept. So Linux would complain from the unchecked MSR
access and the MSR value not having the bit set.

This one seems like a fine candidate to implement in
svm_msr_read_intercept, because Xen needs to return a specific value
for this MSR.

Regarding the global approach to fixing the fallout from the MSR
policy change, I don't see why we couldn't do a mix between pro-active
and reactive.

I think we must have an option to fallback to something similar to the
old policy for HVM guests so that users have a way to get their guests
running after an update without requiring a code change.

That doesn't mean we can't reactively add the missing MSRs as we go
discovering them. I would even print a warning when using such
fallback legacy MSR handling option that you need to report the issue
to xen-devel because the option might be removed in future releases.

Does the above seem like a sensible plan?

Roger.
Jan Beulich March 4, 2021, 10:58 a.m. UTC | #11
On 04.03.2021 11:05, Roger Pau Monné wrote:
> On Thu, Mar 04, 2021 at 09:48:25AM +0100, Jan Beulich wrote:
>> On 03.03.2021 16:38, Roger Pau Monné wrote:
>>> It also raises the question whether we will allow any more exceptions
>>> to the MSR policy, like we did for Windows and OpenBSD in:
>>>
>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=ca88a43e660c75796656a544e54a648c60d26ef0
>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=4175fd3ccd17face664036fa98e9329aa317f7b6
>>>
>>> Or if we are just going to require those guests to enable the legacy
>>> MSR handling instead.
>>
>> It is my understanding that Andrew's view is that adding such special
>> cases is the only acceptable thing. As voiced before I don't share
>> this view, as it means we're going to be entirely reactive to people
>> reporting issues, when I think we should be pro-active as far as is
>> reasonable. Independent of any pro-active measures there'll still be
>> enough reasons for reactive changes - for example I assume Linux
>> would now issue the log message from
>>
>> 	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
>>
>> 		if (c->x86 > 0x10 ||
>> 		    (c->x86 == 0x10 && c->x86_model >= 0x2)) {
>> 			u64 val;
>>
>> 			rdmsrl(MSR_K7_HWCR, val);
>> 			if (!(val & BIT(24)))
>> 				pr_warn(FW_BUG "TSC doesn't count with P0 frequency!\n");
>> 		}
>> 	}
>>
>> since we surface a zero value right now (but I didn't verify this in
>> practice yet).
> 
> I think we inject a #GP to the guest if it tries to access
> MSR_K7_HWCR? As I don't see this MSR handled explicitly in
> svm_msr_read_intercept. So Linux would complain from the unchecked MSR
> access and the MSR value not having the bit set.

Right - my description was of the behavior with my workaround
already in place.

> This one seems like a fine candidate to implement in
> svm_msr_read_intercept, because Xen needs to return a specific value
> for this MSR.
> 
> Regarding the global approach to fixing the fallout from the MSR
> policy change, I don't see why we couldn't do a mix between pro-active
> and reactive.
> 
> I think we must have an option to fallback to something similar to the
> old policy for HVM guests so that users have a way to get their guests
> running after an update without requiring a code change.
> 
> That doesn't mean we can't reactively add the missing MSRs as we go
> discovering them. I would even print a warning when using such
> fallback legacy MSR handling option that you need to report the issue
> to xen-devel because the option might be removed in future releases.
> 
> Does the above seem like a sensible plan?

I think so, yes. I wonder what Andrew thinks, though.

Jan
Ian Jackson March 4, 2021, 11:24 a.m. UTC | #12
Jan Beulich writes ("Re: [PATCH RFC for-4.15] x86/msr: introduce an option for legacy MSR behavior selection"):
> On 04.03.2021 11:05, Roger Pau Monné wrote:
...
> > This one seems like a fine candidate to implement in
> > svm_msr_read_intercept, because Xen needs to return a specific value
> > for this MSR.
> > 
> > Regarding the global approach to fixing the fallout from the MSR
> > policy change, I don't see why we couldn't do a mix between pro-active
> > and reactive.
> > 
> > I think we must have an option to fallback to something similar to the
> > old policy for HVM guests so that users have a way to get their guests
> > running after an update without requiring a code change.
> > 
> > That doesn't mean we can't reactively add the missing MSRs as we go
> > discovering them. I would even print a warning when using such
> > fallback legacy MSR handling option that you need to report the issue
> > to xen-devel because the option might be removed in future releases.
> > 
> > Does the above seem like a sensible plan?
> 
> I think so, yes. I wonder what Andrew thinks, though.

FTR I am on board with this plan.  I would like to see quick progress
on this issue as it seems like one of the major risks in the release.

Thanks,
Ian.
Andrew Cooper March 4, 2021, 2:02 p.m. UTC | #13
On 01/03/2021 19:33, Boris Ostrovsky wrote:
> On 3/1/21 11:23 AM, Roger Pau Monne wrote:
>> Introduce an option to allow selecting the legacy behavior for
>> accesses to MSRs not explicitly handled. Since commit
>> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
>> handled by Xen result in the injection of a #GP to the guest. This is
>> a behavior change since previously a #GP was only injected if
>> accessing the MSR on the real hardware will also trigger a #GP.
>>
>> This seems to be problematic for some guests, so introduce an option
>> to fallback to this legacy behavior. The main difference between what
>> was previously done is that the hardware MSR value is not leaked to
>> the guests on reads.
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> ---
>> Note that this option is not made available to dom0. I'm not sure
>> whether it makes sense to do so, since anyone updating Xen to such
>> newer version will also likely pair it with a newish kernel that
>> doesn't require such workarounds.
>>
>> RFC because there's still some debate as to how we should solve the
>> MSR issue, this is one possible way, but IMO we need to make a
>> decision soon-ish because of the release timeline.
>>
>> Boris, could you please test with Solaris to see if this fixes the
>> issue?
>
> Yes, it does. Thanks.

Really?  This doesn't stop #GP being raised for RAPL_POWER_LIMIT
AFAICT.  Or am I mistaken about how the bug manifested?

~Andrew
Andrew Cooper March 4, 2021, 2:17 p.m. UTC | #14
On 04/03/2021 14:02, Andrew Cooper wrote:
> On 01/03/2021 19:33, Boris Ostrovsky wrote:
>> On 3/1/21 11:23 AM, Roger Pau Monne wrote:
>>> Introduce an option to allow selecting the legacy behavior for
>>> accesses to MSRs not explicitly handled. Since commit
>>> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
>>> handled by Xen result in the injection of a #GP to the guest. This is
>>> a behavior change since previously a #GP was only injected if
>>> accessing the MSR on the real hardware will also trigger a #GP.
>>>
>>> This seems to be problematic for some guests, so introduce an option
>>> to fallback to this legacy behavior. The main difference between what
>>> was previously done is that the hardware MSR value is not leaked to
>>> the guests on reads.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> ---
>>> Note that this option is not made available to dom0. I'm not sure
>>> whether it makes sense to do so, since anyone updating Xen to such
>>> newer version will also likely pair it with a newish kernel that
>>> doesn't require such workarounds.
>>>
>>> RFC because there's still some debate as to how we should solve the
>>> MSR issue, this is one possible way, but IMO we need to make a
>>> decision soon-ish because of the release timeline.
>>>
>>> Boris, could you please test with Solaris to see if this fixes the
>>> issue?
>> Yes, it does. Thanks.
> Really?  This doesn't stop #GP being raised for RAPL_POWER_LIMIT
> AFAICT.  Or am I mistaken about how the bug manifested?

Actually never mind.  I've figured out why.  I have another bugfix to
submit.

~Andrew
diff mbox series

Patch

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 040374dcd6..78dadcdfdd 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -2861,6 +2861,17 @@  No MCA capabilities in above list are enabled.
 
 =back
 
+=item B<msr_legacy_handling=BOOLEAN>
+
+Select whether to use the legacy behaviour for accesses to MSRs not explicitly
+handled by Xen instead of injecting a #GP to the guest.  Such legacy access
+mode will force Xen to replicate the behaviour from the hardware it's currently
+running on in order to decide whether a #GP is injected to the guest.  Note
+that the guest is never allowed access to unhandled MSRs, this option only
+changes whether a #GP might be injected or not.
+
+=back
+
 =back
 
 =head1 SEE ALSO
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index a7b673e89d..3bf6aded97 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -495,6 +495,14 @@ 
  */
 #define LIBXL_HAVE_VMTRACE_BUF_KB 1
 
+/*
+ * LIBXL_HAVE_MSR_LEGACY_HANDLING indicates the toolstack has support for
+ * switching the MSR handling in the hypervisor to legacy mode, where #GP is
+ * only injected to guests for unhandled MSRs if accessing the MSR on the
+ * physical hardware also triggers a #GP.
+ */
+#define LIBXL_HAVE_MSR_LEGACY_HANDLING 1
+
 /*
  * libxl ABI compatibility
  *
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 5b85a7419f..5bb12bc70d 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -644,6 +644,8 @@  libxl_domain_build_info = Struct("domain_build_info",[
     ("arch_arm", Struct(None, [("gic_version", libxl_gic_version),
                                ("vuart", libxl_vuart_type),
                               ])),
+    ("arch_x86", Struct(None, [("msr_legacy_handling", libxl_defbool),
+                              ])),
     # Alternate p2m is not bound to any architecture or guest type, as it is
     # supported by x86 HVM and ARM support is planned.
     ("altp2m", libxl_altp2m_mode),
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 58187ed760..4b2b5d69a6 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -19,6 +19,10 @@  int libxl__arch_domain_prepare_config(libxl__gc *gc,
         abort();
     }
 
+    config->arch.domain_flags = 0;
+    if (libxl_defbool_val(d_config->b_info.arch_x86.msr_legacy_handling))
+        config->arch.domain_flags |= XEN_X86_LEGACY_MSR_HANDLING;
+
     return 0;
 }
 
@@ -809,6 +813,7 @@  void libxl__arch_domain_build_info_setdefault(libxl__gc *gc,
                                               libxl_domain_build_info *b_info)
 {
     libxl_defbool_setdefault(&b_info->acpi, true);
+    libxl_defbool_setdefault(&b_info->arch_x86.msr_legacy_handling, false);
 }
 
 int libxl__arch_passthrough_mode_setdefault(libxl__gc *gc,
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 1893cfc086..fc79b6cc83 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -2741,6 +2741,9 @@  skip_usbdev:
     xlu_cfg_get_defbool(config, "xend_suspend_evtchn_compat",
                         &c_info->xend_suspend_evtchn_compat, 0);
 
+    xlu_cfg_get_defbool(config, "msr_legacy_handling",
+                        &b_info->arch_x86.msr_legacy_handling, 0);
+
     xlu_cfg_destroy(config);
 }
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 6c7ee25f3b..28805d50b8 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -852,6 +852,9 @@  int arch_domain_create(struct domain *d,
 
     domain_cpu_policy_changed(d);
 
+    d->arch.msr_legacy_handling = config->arch.domain_flags &
+                                  XEN_X86_LEGACY_MSR_HANDLING;
+
     return 0;
 
  fail:
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index b819897a4a..b535c5927a 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1795,6 +1795,7 @@  static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
     const struct domain *d = v->domain;
     struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
     const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
+    uint64_t tmp;
 
     switch ( msr )
     {
@@ -1965,6 +1966,11 @@  static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
         break;
 
     default:
+        if ( d->arch.msr_legacy_handling && !rdmsr_safe(msr, tmp) )
+        {
+            *msr_content = 0;
+            break;
+        }
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", msr);
         goto gpf;
     }
@@ -2151,6 +2157,9 @@  static int svm_msr_write_intercept(unsigned int msr, uint64_t msr_content)
         break;
 
     default:
+        if ( d->arch.msr_legacy_handling && !rdmsr_safe(msr, msr_content) )
+            break;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  msr, msr_content);
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index faba95d057..7707ae8cbc 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3114,6 +3114,7 @@  static int is_last_branch_msr(u32 ecx)
 static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
 {
     struct vcpu *curr = current;
+    uint64_t tmp;
 
     HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr);
 
@@ -3195,6 +3196,12 @@  static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
             break;
         }
 
+        if ( curr->domain->arch.msr_legacy_handling && !rdmsr_safe(msr, tmp) )
+        {
+            *msr_content = 0;
+            break;
+        }
+
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", msr);
         goto gp_fault;
     }
@@ -3497,6 +3504,10 @@  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content)
              is_last_branch_msr(msr) )
             break;
 
+        if ( v->domain->arch.msr_legacy_handling &&
+             !rdmsr_safe(msr, msr_content) )
+            break;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  msr, msr_content);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index e5a22b9347..ecf98c4a05 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -875,6 +875,7 @@  static int read_msr(unsigned int reg, uint64_t *val,
     const struct domain *currd = curr->domain;
     const struct cpuid_policy *cp = currd->arch.cpuid;
     bool vpmu_msr = false;
+    uint64_t tmp;
     int ret;
 
     if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
@@ -986,6 +987,12 @@  static int read_msr(unsigned int reg, uint64_t *val,
         }
         /* fall through */
     default:
+        if ( currd->arch.msr_legacy_handling && !rdmsr_safe(reg, tmp) )
+        {
+            *val = 0;
+            return X86EMUL_OKAY;
+        }
+
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
         break;
 
@@ -1148,6 +1155,9 @@  static int write_msr(unsigned int reg, uint64_t val,
         }
         /* fall through */
     default:
+        if ( currd->arch.msr_legacy_handling && !rdmsr_safe(reg, val) )
+            return X86EMUL_OKAY;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  reg, val);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 3900d7b48b..213557fc2c 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -437,6 +437,9 @@  struct arch_domain
     /* Mem_access emulation control */
     bool_t mem_access_emulate_each_rep;
 
+    /* Legacy handling of MSR accesses. */
+    bool msr_legacy_handling;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index 629cb2ba40..0a6b40bb89 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -304,6 +304,9 @@  struct xen_arch_domainconfig {
                                      XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
                                      XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
+
+#define XEN_X86_LEGACY_MSR_HANDLING (1u << 0)
+    uint32_t domain_flags;
 };
 
 /* Location of online VCPU bitmap. */