diff mbox series

x86/amd: do not expose HWCR.TscFreqSel to guests

Message ID 20230912162305.34339-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/amd: do not expose HWCR.TscFreqSel to guests | expand

Commit Message

Roger Pau Monné Sept. 12, 2023, 4:23 p.m. UTC
OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
set, and will also attempt to unconditionally access HWCR if the TSC is
reported as Invariant.

The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
(bogus) warning message, but doing so at the cost of OpenBSD not booting is not
a suitable solution.

In order to fix expose an empty HWCR.

Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Not sure whether we want to expose something when is_cpufreq_controller() is
true, seeing as there's a special wrmsr handler for the same MSR in that case.
Likely should be done for PV only, but also likely quite bogus.

Missing reported by as the issue came from the QubesOS tracker.
---
 xen/arch/x86/msr.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Andrew Cooper Sept. 12, 2023, 4:35 p.m. UTC | #1
On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> set, and will also attempt to unconditionally access HWCR if the TSC is
> reported as Invariant.
>
> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> a suitable solution.
>
> In order to fix expose an empty HWCR.

At first I was thinking a straight up revert, but AMD's CPUID Faulting
is an architectural bit in here so it's worth keeping the register around.

>
> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Not sure whether we want to expose something when is_cpufreq_controller() is
> true, seeing as there's a special wrmsr handler for the same MSR in that case.
> Likely should be done for PV only, but also likely quite bogus.
>
> Missing reported by as the issue came from the QubesOS tracker.

Well - we can at least have a:

Link: https://github.com/QubesOS/qubes-issues/issues/8502

in the commit message, and it's probably worth asking Solène / Marek
(both CC'd) if they want a Reported-by tag.

> ---
>  xen/arch/x86/msr.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
> index 3f0450259cdf..964d500ff8a1 100644
> --- a/xen/arch/x86/msr.c
> +++ b/xen/arch/x86/msr.c
> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>      case MSR_K8_HWCR:
>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>              goto gp_fault;
> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
> +        /*
> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
> +         * also poke at PSTATE0.
> +         */

While this is true, the justification for removing this is because
TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.

Also because it's addition without writing into the migration stream was
bogus irrespective of the specifics of the bit.

I'm still of the opinion that it's buggy for OpenBSD to be looking at
model specific bits when virtualised, but given my latest reading of the
AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
can see TSC_FREQ_SEL.

In some theoretical future where the toolstack better understands MSRs
and (non)migratable VMs (which is the QubesOS usecase), then it would in
principle be fine to construct a VM which can see the host TSC_FREQ_SEL
and PSTATE* values.

Preferably with an adjusted comment, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

~Andrew
Andrew Cooper Sept. 12, 2023, 4:36 p.m. UTC | #2
On 12/09/2023 5:35 pm, Andrew Cooper wrote:
> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
>> set, and will also attempt to unconditionally access HWCR if the TSC is
>> reported as Invariant.
>>
>> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
>> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
>> a suitable solution.
>>
>> In order to fix expose an empty HWCR.
> At first I was thinking a straight up revert, but AMD's CPUID Faulting
> is an architectural bit in here so it's worth keeping the register around.
>
>> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> Not sure whether we want to expose something when is_cpufreq_controller() is
>> true, seeing as there's a special wrmsr handler for the same MSR in that case.
>> Likely should be done for PV only, but also likely quite bogus.
>>
>> Missing reported by as the issue came from the QubesOS tracker.
> Well - we can at least have a:
>
> Link: https://github.com/QubesOS/qubes-issues/issues/8502
>
> in the commit message, and it's probably worth asking Solène / Marek
> (both CC'd) if they want a Reported-by tag.
>
>> ---
>>  xen/arch/x86/msr.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
>> index 3f0450259cdf..964d500ff8a1 100644
>> --- a/xen/arch/x86/msr.c
>> +++ b/xen/arch/x86/msr.c
>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>      case MSR_K8_HWCR:
>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>              goto gp_fault;
>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>> +        /*
>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>> +         * also poke at PSTATE0.
>> +         */
> While this is true, the justification for removing this is because
> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
>
> Also because it's addition without writing into the migration stream was
> bogus irrespective of the specifics of the bit.
>
> I'm still of the opinion that it's buggy for OpenBSD to be looking at
> model specific bits when virtualised, but given my latest reading of the
> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
> can see TSC_FREQ_SEL.
>
> In some theoretical future where the toolstack better understands MSRs
> and (non)migratable VMs (which is the QubesOS usecase), then it would in
> principle be fine to construct a VM which can see the host TSC_FREQ_SEL
> and PSTATE* values.
>
> Preferably with an adjusted comment, Reviewed-by: Andrew Cooper
> <andrew.cooper3@citrix.com>

Sorry - I meant to be clearer here.  I'd suggest just deleting the
comment and leaving an unconditional return of 0 (which will become
conditional when we wire up CPUID Faulting).

MSR_HWCR *is* an architectural MSR on any 64bit AMD system, so shouldn't
fault.

~Andrew
Jan Beulich Sept. 13, 2023, 6:14 a.m. UTC | #3
On 12.09.2023 18:23, Roger Pau Monne wrote:
> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> set, and will also attempt to unconditionally access HWCR if the TSC is
> reported as Invariant.
> 
> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> a suitable solution.

Why is the warning bogus? The PPR doesn't even state what the bit being
clear means; it's a r/o one. On respective families it cannot possibly
be correct to expose it clear.

> In order to fix expose an empty HWCR.
> 
> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Not sure whether we want to expose something when is_cpufreq_controller() is
> true, seeing as there's a special wrmsr handler for the same MSR in that case.
> Likely should be done for PV only, but also likely quite bogus.

Well, keying to is_cpufreq_controller() is indeed questionable, but is
there any reason to not minimally expose the bit correctly when a
domain cannot migrate?

Jan
Jan Beulich Sept. 13, 2023, 6:18 a.m. UTC | #4
On 12.09.2023 18:35, Andrew Cooper wrote:
> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>> --- a/xen/arch/x86/msr.c
>> +++ b/xen/arch/x86/msr.c
>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>      case MSR_K8_HWCR:
>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>              goto gp_fault;
>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>> +        /*
>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>> +         * also poke at PSTATE0.
>> +         */
> 
> While this is true, the justification for removing this is because
> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
> 
> Also because it's addition without writing into the migration stream was
> bogus irrespective of the specifics of the bit.
> 
> I'm still of the opinion that it's buggy for OpenBSD to be looking at
> model specific bits when virtualised, but given my latest reading of the
> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
> can see TSC_FREQ_SEL.

I'm afraid I'm still at a loss seeing what wording in which PPR makes you
see a connection. If there was a connection, I'd like to ask that we at
least properly consider exposing PSTATE0 (or if necessary all PSTATEn) as
well, with zero value (i.e. in particular with the valid bit clear),
rather than exposing a r/o bit with the wrong value, upsetting Linux.

Jan
Roger Pau Monné Sept. 13, 2023, 7:50 a.m. UTC | #5
On Tue, Sep 12, 2023 at 05:35:15PM +0100, Andrew Cooper wrote:
> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
> > OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> > set, and will also attempt to unconditionally access HWCR if the TSC is
> > reported as Invariant.
> >
> > The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> > (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> > a suitable solution.
> >
> > In order to fix expose an empty HWCR.
> 
> At first I was thinking a straight up revert, but AMD's CPUID Faulting
> is an architectural bit in here so it's worth keeping the register around.

A straight up revert won't work, because (as you notice below)
HWCR is architectural, so accesses must not fault.

> >
> > Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Not sure whether we want to expose something when is_cpufreq_controller() is
> > true, seeing as there's a special wrmsr handler for the same MSR in that case.
> > Likely should be done for PV only, but also likely quite bogus.
> >
> > Missing reported by as the issue came from the QubesOS tracker.
> 
> Well - we can at least have a:
> 
> Link: https://github.com/QubesOS/qubes-issues/issues/8502

Sure.

> in the commit message, and it's probably worth asking Solène / Marek
> (both CC'd) if they want a Reported-by tag.

I'm happy to add a Reported-by tag, just didn't have an email to use.

> > ---
> >  xen/arch/x86/msr.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
> > index 3f0450259cdf..964d500ff8a1 100644
> > --- a/xen/arch/x86/msr.c
> > +++ b/xen/arch/x86/msr.c
> > @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
> >      case MSR_K8_HWCR:
> >          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
> >              goto gp_fault;
> > -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
> > -               ? K8_HWCR_TSC_FREQ_SEL : 0;
> > +        /*
> > +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
> > +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
> > +         * also poke at PSTATE0.
> > +         */
> 
> While this is true, the justification for removing this is because
> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
> 
> Also because it's addition without writing into the migration stream was
> bogus irrespective of the specifics of the bit.
> 
> I'm still of the opinion that it's buggy for OpenBSD to be looking at
> model specific bits when virtualised,

Well, I think we can argue that an OS is free to ignore the CPUID HV
bit and still boot on Xen (even if that leads to non-ideal decisions).

> but given my latest reading of the
> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
> can see TSC_FREQ_SEL.

Hm, there's no written down note that TSC_FREQ_SEL implies PSTATE0 to
be available (and PSTATE0 is not an architectural MSR), but I can see
how a guest can expect to fetch the P0 frequency if it sees
TSC_FREQ_SEL.  It might have been more fail safe to check for
PSTATE_LIMIT not faulting before attempting to access PSTATE0.

> In some theoretical future where the toolstack better understands MSRs
> and (non)migratable VMs (which is the QubesOS usecase), then it would in
> principle be fine to construct a VM which can see the host TSC_FREQ_SEL
> and PSTATE* values.
> 
> Preferably with an adjusted comment, Reviewed-by: Andrew Cooper
> <andrew.cooper3@citrix.com>

Thanks, will reply to other comments before taking the RB and
resending.

Roger.
Roger Pau Monné Sept. 13, 2023, 8:08 a.m. UTC | #6
On Tue, Sep 12, 2023 at 05:36:53PM +0100, Andrew Cooper wrote:
> On 12/09/2023 5:35 pm, Andrew Cooper wrote:
> > On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
> >> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> >> set, and will also attempt to unconditionally access HWCR if the TSC is
> >> reported as Invariant.
> >>
> >> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> >> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> >> a suitable solution.
> >>
> >> In order to fix expose an empty HWCR.
> > At first I was thinking a straight up revert, but AMD's CPUID Faulting
> > is an architectural bit in here so it's worth keeping the register around.
> >
> >> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >> ---
> >> Not sure whether we want to expose something when is_cpufreq_controller() is
> >> true, seeing as there's a special wrmsr handler for the same MSR in that case.
> >> Likely should be done for PV only, but also likely quite bogus.
> >>
> >> Missing reported by as the issue came from the QubesOS tracker.
> > Well - we can at least have a:
> >
> > Link: https://github.com/QubesOS/qubes-issues/issues/8502
> >
> > in the commit message, and it's probably worth asking Solène / Marek
> > (both CC'd) if they want a Reported-by tag.
> >
> >> ---
> >>  xen/arch/x86/msr.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
> >> index 3f0450259cdf..964d500ff8a1 100644
> >> --- a/xen/arch/x86/msr.c
> >> +++ b/xen/arch/x86/msr.c
> >> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
> >>      case MSR_K8_HWCR:
> >>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
> >>              goto gp_fault;
> >> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
> >> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
> >> +        /*
> >> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
> >> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
> >> +         * also poke at PSTATE0.
> >> +         */
> > While this is true, the justification for removing this is because
> > TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
> >
> > Also because it's addition without writing into the migration stream was
> > bogus irrespective of the specifics of the bit.
> >
> > I'm still of the opinion that it's buggy for OpenBSD to be looking at
> > model specific bits when virtualised, but given my latest reading of the
> > AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
> > can see TSC_FREQ_SEL.
> >
> > In some theoretical future where the toolstack better understands MSRs
> > and (non)migratable VMs (which is the QubesOS usecase), then it would in
> > principle be fine to construct a VM which can see the host TSC_FREQ_SEL
> > and PSTATE* values.
> >
> > Preferably with an adjusted comment, Reviewed-by: Andrew Cooper
> > <andrew.cooper3@citrix.com>
> 
> Sorry - I meant to be clearer here.  I'd suggest just deleting the
> comment and leaving an unconditional return of 0 (which will become
> conditional when we wire up CPUID Faulting).
> 
> MSR_HWCR *is* an architectural MSR on any 64bit AMD system, so shouldn't
> fault.

Hm, I think it's worth to at least keep a note that if TSC_FREQ_SEL is
exposed PSTATE0 must also be exposed to prevent OpenBSD 7.3 from
panicking.

Thanks, Roger.
Roger Pau Monné Sept. 13, 2023, 8:20 a.m. UTC | #7
On Wed, Sep 13, 2023 at 08:14:46AM +0200, Jan Beulich wrote:
> On 12.09.2023 18:23, Roger Pau Monne wrote:
> > OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> > set, and will also attempt to unconditionally access HWCR if the TSC is
> > reported as Invariant.
> > 
> > The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> > (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> > a suitable solution.
> 
> Why is the warning bogus? The PPR doesn't even state what the bit being
> clear means; it's a r/o one. On respective families it cannot possibly
> be correct to expose it clear.

There are other bits in the same MSR that only state the meaning of
one value and are not r/o, so my take is that being clear means "The
TSC doesn't increment at the P0 frequency".

Since it's model specific, it should be possible for some models to
have the bit clear.

> > In order to fix expose an empty HWCR.
> > 
> > Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Not sure whether we want to expose something when is_cpufreq_controller() is
> > true, seeing as there's a special wrmsr handler for the same MSR in that case.
> > Likely should be done for PV only, but also likely quite bogus.
> 
> Well, keying to is_cpufreq_controller() is indeed questionable, but is
> there any reason to not minimally expose the bit correctly when a
> domain cannot migrate?

We would then also need to expose PSTATE0 at least to make OpenBSD
happy (and any otehr guest that makes the connection between
TscFreqSel and getting the P0 frequency from PSTATE0.

Thanks, Roger.
Jan Beulich Sept. 13, 2023, 8:35 a.m. UTC | #8
On 13.09.2023 10:20, Roger Pau Monné wrote:
> On Wed, Sep 13, 2023 at 08:14:46AM +0200, Jan Beulich wrote:
>> On 12.09.2023 18:23, Roger Pau Monne wrote:
>>> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
>>> set, and will also attempt to unconditionally access HWCR if the TSC is
>>> reported as Invariant.
>>>
>>> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
>>> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
>>> a suitable solution.
>>
>> Why is the warning bogus? The PPR doesn't even state what the bit being
>> clear means; it's a r/o one. On respective families it cannot possibly
>> be correct to expose it clear.
> 
> There are other bits in the same MSR that only state the meaning of
> one value and are not r/o, so my take is that being clear means "The
> TSC doesn't increment at the P0 frequency".
> 
> Since it's model specific, it should be possible for some models to
> have the bit clear.

The code that's in there right now already has a family >= 0x10 check.
The Fam10 BKDG says (among other things) "BIOS should program this bit
to 1." That, imo, doesn't leave much room for the bit being clear once
an OS (or hypervisor) gains control from firmware.

>>> In order to fix expose an empty HWCR.
>>>
>>> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> Not sure whether we want to expose something when is_cpufreq_controller() is
>>> true, seeing as there's a special wrmsr handler for the same MSR in that case.
>>> Likely should be done for PV only, but also likely quite bogus.
>>
>> Well, keying to is_cpufreq_controller() is indeed questionable, but is
>> there any reason to not minimally expose the bit correctly when a
>> domain cannot migrate?
> 
> We would then also need to expose PSTATE0 at least to make OpenBSD
> happy (and any otehr guest that makes the connection between
> TscFreqSel and getting the P0 frequency from PSTATE0.

If any such OSes can be used as Dom0, yes. And as said before, I view
exposing PSTATE0 (with zero value) as a viable alternative to your
partial revert anyway. What varies across families is how many PSTATEn
there are, but at least starting from Fam10 PSTATE0 looks to always be
there, with the top bit indicating validity of the other defined bits.

Jan
Roger Pau Monné Sept. 13, 2023, 8:50 a.m. UTC | #9
On Wed, Sep 13, 2023 at 08:18:57AM +0200, Jan Beulich wrote:
> On 12.09.2023 18:35, Andrew Cooper wrote:
> > On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
> >> --- a/xen/arch/x86/msr.c
> >> +++ b/xen/arch/x86/msr.c
> >> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
> >>      case MSR_K8_HWCR:
> >>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
> >>              goto gp_fault;
> >> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
> >> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
> >> +        /*
> >> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
> >> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
> >> +         * also poke at PSTATE0.
> >> +         */
> > 
> > While this is true, the justification for removing this is because
> > TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
> > 
> > Also because it's addition without writing into the migration stream was
> > bogus irrespective of the specifics of the bit.
> > 
> > I'm still of the opinion that it's buggy for OpenBSD to be looking at
> > model specific bits when virtualised, but given my latest reading of the
> > AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
> > can see TSC_FREQ_SEL.
> 
> I'm afraid I'm still at a loss seeing what wording in which PPR makes you
> see a connection. If there was a connection, I'd like to ask that we at
> least properly consider exposing PSTATE0 (or if necessary all PSTATEn) as
> well, with zero value (i.e. in particular with the valid bit clear),
> rather than exposing a r/o bit with the wrong value, upsetting Linux.

But PSTATEn is also non-architectural, so the bit meaning could in
theory change between models.

There seems to be no bit anywhere that signals whether PSTATEn is
available, as it's my reading that you can have PSTATEn without the
architectural PSTATE_{LIMIT,CTRL,STATUS} MSRs that are signaled by
HW_PSTATE CPUID bit.

Thanks, Roger.
Roger Pau Monné Sept. 13, 2023, 9:16 a.m. UTC | #10
On Wed, Sep 13, 2023 at 10:35:17AM +0200, Jan Beulich wrote:
> On 13.09.2023 10:20, Roger Pau Monné wrote:
> > On Wed, Sep 13, 2023 at 08:14:46AM +0200, Jan Beulich wrote:
> >> On 12.09.2023 18:23, Roger Pau Monne wrote:
> >>> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
> >>> set, and will also attempt to unconditionally access HWCR if the TSC is
> >>> reported as Invariant.
> >>>
> >>> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
> >>> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
> >>> a suitable solution.
> >>
> >> Why is the warning bogus? The PPR doesn't even state what the bit being
> >> clear means; it's a r/o one. On respective families it cannot possibly
> >> be correct to expose it clear.
> > 
> > There are other bits in the same MSR that only state the meaning of
> > one value and are not r/o, so my take is that being clear means "The
> > TSC doesn't increment at the P0 frequency".
> > 
> > Since it's model specific, it should be possible for some models to
> > have the bit clear.
> 
> The code that's in there right now already has a family >= 0x10 check.
> The Fam10 BKDG says (among other things) "BIOS should program this bit
> to 1." That, imo, doesn't leave much room for the bit being clear once
> an OS (or hypervisor) gains control from firmware.
> 
> >>> In order to fix expose an empty HWCR.
> >>>
> >>> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>> ---
> >>> Not sure whether we want to expose something when is_cpufreq_controller() is
> >>> true, seeing as there's a special wrmsr handler for the same MSR in that case.
> >>> Likely should be done for PV only, but also likely quite bogus.
> >>
> >> Well, keying to is_cpufreq_controller() is indeed questionable, but is
> >> there any reason to not minimally expose the bit correctly when a
> >> domain cannot migrate?
> > 
> > We would then also need to expose PSTATE0 at least to make OpenBSD
> > happy (and any otehr guest that makes the connection between
> > TscFreqSel and getting the P0 frequency from PSTATE0.
> 
> If any such OSes can be used as Dom0, yes.

OpenBSD can't be used as dom0, but QubesOS does use the nomigrate flag
for domUs.

> And as said before, I view
> exposing PSTATE0 (with zero value) as a viable alternative to your
> partial revert anyway. What varies across families is how many PSTATEn
> there are, but at least starting from Fam10 PSTATE0 looks to always be
> there, with the top bit indicating validity of the other defined bits.

I did consider this, but it seemed to just dig us deeper into exposing
non-architectural MSRs, which in the long run is more likely to be
troublesome if newer models change the meaning of bits in PSTATEn.

Thanks, Roger.
Jan Beulich Sept. 13, 2023, 9:40 a.m. UTC | #11
On 13.09.2023 11:16, Roger Pau Monné wrote:
> On Wed, Sep 13, 2023 at 10:35:17AM +0200, Jan Beulich wrote:
>> On 13.09.2023 10:20, Roger Pau Monné wrote:
>>> On Wed, Sep 13, 2023 at 08:14:46AM +0200, Jan Beulich wrote:
>>>> On 12.09.2023 18:23, Roger Pau Monne wrote:
>>>>> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
>>>>> set, and will also attempt to unconditionally access HWCR if the TSC is
>>>>> reported as Invariant.
>>>>>
>>>>> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
>>>>> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
>>>>> a suitable solution.
>>>>
>>>> Why is the warning bogus? The PPR doesn't even state what the bit being
>>>> clear means; it's a r/o one. On respective families it cannot possibly
>>>> be correct to expose it clear.
>>>
>>> There are other bits in the same MSR that only state the meaning of
>>> one value and are not r/o, so my take is that being clear means "The
>>> TSC doesn't increment at the P0 frequency".
>>>
>>> Since it's model specific, it should be possible for some models to
>>> have the bit clear.
>>
>> The code that's in there right now already has a family >= 0x10 check.
>> The Fam10 BKDG says (among other things) "BIOS should program this bit
>> to 1." That, imo, doesn't leave much room for the bit being clear once
>> an OS (or hypervisor) gains control from firmware.
>>
>>>>> In order to fix expose an empty HWCR.
>>>>>
>>>>> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
>>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>>> ---
>>>>> Not sure whether we want to expose something when is_cpufreq_controller() is
>>>>> true, seeing as there's a special wrmsr handler for the same MSR in that case.
>>>>> Likely should be done for PV only, but also likely quite bogus.
>>>>
>>>> Well, keying to is_cpufreq_controller() is indeed questionable, but is
>>>> there any reason to not minimally expose the bit correctly when a
>>>> domain cannot migrate?
>>>
>>> We would then also need to expose PSTATE0 at least to make OpenBSD
>>> happy (and any otehr guest that makes the connection between
>>> TscFreqSel and getting the P0 frequency from PSTATE0.
>>
>> If any such OSes can be used as Dom0, yes.
> 
> OpenBSD can't be used as dom0, but QubesOS does use the nomigrate flag
> for domUs.
> 
>> And as said before, I view
>> exposing PSTATE0 (with zero value) as a viable alternative to your
>> partial revert anyway. What varies across families is how many PSTATEn
>> there are, but at least starting from Fam10 PSTATE0 looks to always be
>> there, with the top bit indicating validity of the other defined bits.
> 
> I did consider this, but it seemed to just dig us deeper into exposing
> non-architectural MSRs, which in the long run is more likely to be
> troublesome if newer models change the meaning of bits in PSTATEn.

Not sure about "deeper" - we're discussing to expose a non-architectural
bit in HWCR with the wrong value vs exposing a non-architectural MSR
where we'd rely on only the top bit retaining its meaning going forward.

Jan
Andrew Cooper Sept. 13, 2023, 10:11 a.m. UTC | #12
On 13/09/2023 9:08 am, Roger Pau Monné wrote:
> On Tue, Sep 12, 2023 at 05:36:53PM +0100, Andrew Cooper wrote:
>> On 12/09/2023 5:35 pm, Andrew Cooper wrote:
>>> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>>>> OpenBSD will attempt to unconditionally access PSTATE0 if HWCR.TscFreqSel is
>>>> set, and will also attempt to unconditionally access HWCR if the TSC is
>>>> reported as Invariant.
>>>>
>>>> The reasoning for exposing HWCR.TscFreqSel was to avoid Linux from printing a
>>>> (bogus) warning message, but doing so at the cost of OpenBSD not booting is not
>>>> a suitable solution.
>>>>
>>>> In order to fix expose an empty HWCR.
>>> At first I was thinking a straight up revert, but AMD's CPUID Faulting
>>> is an architectural bit in here so it's worth keeping the register around.
>>>
>>>> Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests')
>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>> ---
>>>> Not sure whether we want to expose something when is_cpufreq_controller() is
>>>> true, seeing as there's a special wrmsr handler for the same MSR in that case.
>>>> Likely should be done for PV only, but also likely quite bogus.
>>>>
>>>> Missing reported by as the issue came from the QubesOS tracker.
>>> Well - we can at least have a:
>>>
>>> Link: https://github.com/QubesOS/qubes-issues/issues/8502
>>>
>>> in the commit message, and it's probably worth asking Solène / Marek
>>> (both CC'd) if they want a Reported-by tag.
>>>
>>>> ---
>>>>  xen/arch/x86/msr.c | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
>>>> index 3f0450259cdf..964d500ff8a1 100644
>>>> --- a/xen/arch/x86/msr.c
>>>> +++ b/xen/arch/x86/msr.c
>>>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>>>      case MSR_K8_HWCR:
>>>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>>>              goto gp_fault;
>>>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>>>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>>>> +        /*
>>>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>>>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>>>> +         * also poke at PSTATE0.
>>>> +         */
>>> While this is true, the justification for removing this is because
>>> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
>>>
>>> Also because it's addition without writing into the migration stream was
>>> bogus irrespective of the specifics of the bit.
>>>
>>> I'm still of the opinion that it's buggy for OpenBSD to be looking at
>>> model specific bits when virtualised, but given my latest reading of the
>>> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
>>> can see TSC_FREQ_SEL.
>>>
>>> In some theoretical future where the toolstack better understands MSRs
>>> and (non)migratable VMs (which is the QubesOS usecase), then it would in
>>> principle be fine to construct a VM which can see the host TSC_FREQ_SEL
>>> and PSTATE* values.
>>>
>>> Preferably with an adjusted comment, Reviewed-by: Andrew Cooper
>>> <andrew.cooper3@citrix.com>
>> Sorry - I meant to be clearer here.  I'd suggest just deleting the
>> comment and leaving an unconditional return of 0 (which will become
>> conditional when we wire up CPUID Faulting).
>>
>> MSR_HWCR *is* an architectural MSR on any 64bit AMD system, so shouldn't
>> fault.
> Hm, I think it's worth to at least keep a note that if TSC_FREQ_SEL is
> exposed PSTATE0 must also be exposed to prevent OpenBSD 7.3 from
> panicking.

But there's nothing OpenBSD 7.3 specific about it.

Any software which sees this bit is permitted (expected even) to edit
the pstate registers.


You know why it's called frequency select?  Because firmware is supposed
to program the systemwide frequency/voltage to the user's request for
whether the system wants to run cooler, or be overclocked.


Putting OpenBSD 7.3 in the commit message is absolutely relevant to why
we're making this change now, but it's not relevant to why we have
concluded that TSC_FREQ_SEL is bogus to expose to guests.

~Andrew
Andrew Cooper Sept. 13, 2023, 10:50 a.m. UTC | #13
On 13/09/2023 8:50 am, Roger Pau Monné wrote:
> On Tue, Sep 12, 2023 at 05:35:15PM +0100, Andrew Cooper wrote:
>> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>>> ---
>>>  xen/arch/x86/msr.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
>>> index 3f0450259cdf..964d500ff8a1 100644
>>> --- a/xen/arch/x86/msr.c
>>> +++ b/xen/arch/x86/msr.c
>>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>>      case MSR_K8_HWCR:
>>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>>              goto gp_fault;
>>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>>> +        /*
>>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>>> +         * also poke at PSTATE0.
>>> +         */
>> While this is true, the justification for removing this is because
>> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
>>
>> Also because it's addition without writing into the migration stream was
>> bogus irrespective of the specifics of the bit.
>>
>> I'm still of the opinion that it's buggy for OpenBSD to be looking at
>> model specific bits when virtualised,
> Well, I think we can argue that an OS is free to ignore the CPUID HV
> bit and still boot on Xen (even if that leads to non-ideal decisions).

An OS which keeps itself to architectural details that we do our very
best to be correct with, ought to function even if it ignores the HV bit.

But we're (deliberately) not doing model-specific-accurate emulations of
a specific platform.  An OS which ignores details about the environment
it is operating in gets to keep the faults/malfunctions when it does
something illegal.

>> but given my latest reading of the
>> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
>> can see TSC_FREQ_SEL.
> Hm, there's no written down note that TSC_FREQ_SEL implies PSTATE0 to
> be available (and PSTATE0 is not an architectural MSR), but I can see
> how a guest can expect to fetch the P0 frequency if it sees
> TSC_FREQ_SEL.

The PPR is a reference of mostly autogenerated details and misc notes,
put together in a non- hand-write way, unlike the older BKWGs.

Lots of the information elided from public and partner-NDA versions is
"see TICKET/LINK for rational" type comments.

It is not a spec - it is a reference (the clue is even in the name)
aimed at people already familiar with the area.  Do not fall into the
trap of thinking it it can be read as a spec.

~Andrew
Andrew Cooper Sept. 13, 2023, 11:02 a.m. UTC | #14
On 13/09/2023 7:18 am, Jan Beulich wrote:
> On 12.09.2023 18:35, Andrew Cooper wrote:
>> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/msr.c
>>> +++ b/xen/arch/x86/msr.c
>>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>>      case MSR_K8_HWCR:
>>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>>              goto gp_fault;
>>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>>> +        /*
>>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>>> +         * also poke at PSTATE0.
>>> +         */
>> While this is true, the justification for removing this is because
>> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
>>
>> Also because it's addition without writing into the migration stream was
>> bogus irrespective of the specifics of the bit.
>>
>> I'm still of the opinion that it's buggy for OpenBSD to be looking at
>> model specific bits when virtualised, but given my latest reading of the
>> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
>> can see TSC_FREQ_SEL.
> I'm afraid I'm still at a loss seeing what wording in which PPR makes you
> see a connection. If there was a connection, I'd like to ask that we at
> least properly consider exposing PSTATE0 (or if necessary all PSTATEn) as
> well, with zero value (i.e. in particular with the valid bit clear),
> rather than exposing a r/o bit with the wrong value, upsetting Linux.

This mess is caused by the erroneous advertising of a model specific
bit, and there's no way in hell that giving the guest even more model
specific data is going to fix it.

The PSTATE MSRs are entirely model specific, fully read/write, and the
Enable bit is not an enable bit; its a "not valid yet" bit that firmware
is required to adjust to be consistent across the coherency fabric.

Linux is simply wrong with it's printk() under virt, and wants adjusting.

~Andrew
Jan Beulich Sept. 13, 2023, 1:27 p.m. UTC | #15
On 13.09.2023 12:50, Andrew Cooper wrote:
> On 13/09/2023 8:50 am, Roger Pau Monné wrote:
>> Hm, there's no written down note that TSC_FREQ_SEL implies PSTATE0 to
>> be available (and PSTATE0 is not an architectural MSR), but I can see
>> how a guest can expect to fetch the P0 frequency if it sees
>> TSC_FREQ_SEL.
> 
> The PPR is a reference of mostly autogenerated details and misc notes,
> put together in a non- hand-write way, unlike the older BKWGs.
> 
> Lots of the information elided from public and partner-NDA versions is
> "see TICKET/LINK for rational" type comments.
> 
> It is not a spec - it is a reference (the clue is even in the name)
> aimed at people already familiar with the area.  Do not fall into the
> trap of thinking it it can be read as a spec.

But then where is it written down that the bit set implies the PSTATEn
MSRs to exist?

Jan
Jan Beulich Sept. 13, 2023, 1:37 p.m. UTC | #16
On 13.09.2023 13:02, Andrew Cooper wrote:
> On 13/09/2023 7:18 am, Jan Beulich wrote:
>> On 12.09.2023 18:35, Andrew Cooper wrote:
>>> On 12/09/2023 5:23 pm, Roger Pau Monne wrote:
>>>> --- a/xen/arch/x86/msr.c
>>>> +++ b/xen/arch/x86/msr.c
>>>> @@ -240,8 +240,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
>>>>      case MSR_K8_HWCR:
>>>>          if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
>>>>              goto gp_fault;
>>>> -        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
>>>> -               ? K8_HWCR_TSC_FREQ_SEL : 0;
>>>> +        /*
>>>> +         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
>>>> +         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
>>>> +         * also poke at PSTATE0.
>>>> +         */
>>> While this is true, the justification for removing this is because
>>> TSC_FREQ_SEL is a model specific bit, not an architectural bit in HWCR.
>>>
>>> Also because it's addition without writing into the migration stream was
>>> bogus irrespective of the specifics of the bit.
>>>
>>> I'm still of the opinion that it's buggy for OpenBSD to be looking at
>>> model specific bits when virtualised, but given my latest reading of the
>>> AMD manuals, I think OpenBSD *is* well behaved looking at PSTATE0 if it
>>> can see TSC_FREQ_SEL.
>> I'm afraid I'm still at a loss seeing what wording in which PPR makes you
>> see a connection. If there was a connection, I'd like to ask that we at
>> least properly consider exposing PSTATE0 (or if necessary all PSTATEn) as
>> well, with zero value (i.e. in particular with the valid bit clear),
>> rather than exposing a r/o bit with the wrong value, upsetting Linux.
> 
> This mess is caused by the erroneous advertising of a model specific
> bit, and there's no way in hell that giving the guest even more model
> specific data is going to fix it.
> 
> The PSTATE MSRs are entirely model specific, fully read/write, and the
> Enable bit is not an enable bit; its a "not valid yet" bit that firmware
> is required to adjust to be consistent across the coherency fabric.
> 
> Linux is simply wrong with it's printk() under virt, and wants adjusting.

Assuming Roger agrees with this statement, then I think this is another
item wanting mentioning in the description. I then still wouldn't ack
the change, but I also wouldn't object to it anymore.

Jan
Thomas Gleixner Sept. 13, 2023, 8:31 p.m. UTC | #17
On Wed, Sep 13 2023 at 12:02, Andrew Cooper wrote:
> The PSTATE MSRs are entirely model specific, fully read/write, and the
> Enable bit is not an enable bit; its a "not valid yet" bit that firmware
> is required to adjust to be consistent across the coherency fabric.
>
> Linux is simply wrong with it's printk() under virt, and wants adjusting.

No objections from my side.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index 3f0450259cdf..964d500ff8a1 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -240,8 +240,12 @@  int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
     case MSR_K8_HWCR:
         if ( !(cp->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) )
             goto gp_fault;
-        *val = get_cpu_family(cp->basic.raw_fms, NULL, NULL) >= 0x10
-               ? K8_HWCR_TSC_FREQ_SEL : 0;
+        /*
+         * OpenBSD 7.3 accesses HWCR unconditionally if the TSC is reported as
+         * Invariant.  Do not set TSC_FREQ_SEL as that would trigger OpenBSD to
+         * also poke at PSTATE0.
+         */
+        *val = 0;
         break;
 
     case MSR_VIRT_SPEC_CTRL: