diff mbox series

[RFC] x86/amd: Avoid cpu_has_hypervisor evaluating true on native hardware

Message ID 20200127202121.2961-1-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series [RFC] x86/amd: Avoid cpu_has_hypervisor evaluating true on native hardware | expand

Commit Message

Andrew Cooper Jan. 27, 2020, 8:21 p.m. UTC
Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
configured with the HYPERVISOR bit before native CPUID is scanned for feature
bits.

This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
ends up appearing in the raw and host CPU policies.  Nothing has really cared
in the past.

Alter amd_init_levelling() to exclude the HYPERVISOR bit from
cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
explicitly forwarded.

This in turn highlighted that dom0 construction was asymetric with domU
construction, by not having any calls to domain_cpu_policy_changed().  Extend
arch_domain_create() to always call domain_cpu_policy_changed().

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Igor Druzhinin <igor.druzhinin@citrix.com>

Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
I haven't investgiated the issue with that patch specifically, because
cpu_has_hypervisor being wrong is obviously a bug.

This is one of two possible approaches, and both have their downsides.  This
one takes an extra hit on context switches between PV vcpus and idle/hvm, as
they will usually differ in HYPERVISOR bit.

The other approach is to order things more carefully so levelling is
configured after scanning for cpuid bits, but that has the downside that it is
very easy to regress.

Thoughts on which is the least-bad approach to take?  Having written this
patch, I'm now erring on the side of doing it the other way.
---
 xen/arch/x86/cpu/amd.c       | 3 ---
 xen/arch/x86/domain.c        | 2 ++
 xen/arch/x86/domctl.c        | 9 ++++++++-
 xen/include/asm-x86/domain.h | 2 ++
 4 files changed, 12 insertions(+), 4 deletions(-)

Comments

Roger Pau Monné Jan. 28, 2020, 10:39 a.m. UTC | #1
On Mon, Jan 27, 2020 at 08:21:21PM +0000, Andrew Cooper wrote:
> Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
> configured with the HYPERVISOR bit before native CPUID is scanned for feature
> bits.
> 
> This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
> ends up appearing in the raw and host CPU policies.  Nothing has really cared
> in the past.
> 
> Alter amd_init_levelling() to exclude the HYPERVISOR bit from
> cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
> explicitly forwarded.
> 
> This in turn highlighted that dom0 construction was asymetric with domU
> construction, by not having any calls to domain_cpu_policy_changed().  Extend
> arch_domain_create() to always call domain_cpu_policy_changed().
> 
> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wl@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Igor Druzhinin <igor.druzhinin@citrix.com>
> 
> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
> I haven't investgiated the issue with that patch specifically, because
> cpu_has_hypervisor being wrong is obviously a bug.

I've tested the series on one AMD box and it worked for me. Even if
cpu_has_hypervisor is set on real hardware the added call to
hypervisor_flush_tlb should be fine as ops is NULL in that case and
would just be a dummy (this obviously needs to be fixed so
cpu_has_hypervisor isn't true when not running virtualized).

> 
> This is one of two possible approaches, and both have their downsides.  This
> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
> they will usually differ in HYPERVISOR bit.
> 
> The other approach is to order things more carefully so levelling is
> configured after scanning for cpuid bits, but that has the downside that it is
> very easy to regress.
> 
> Thoughts on which is the least-bad approach to take?  Having written this
> patch, I'm now erring on the side of doing it the other way.
> ---
>  xen/arch/x86/cpu/amd.c       | 3 ---
>  xen/arch/x86/domain.c        | 2 ++
>  xen/arch/x86/domctl.c        | 9 ++++++++-
>  xen/include/asm-x86/domain.h | 2 ++
>  4 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> index 8b5f0f2e4c..0906b23582 100644
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -297,9 +297,6 @@ static void __init noinline amd_init_levelling(void)
>  			ecx |= cpufeat_mask(X86_FEATURE_OSXSAVE);
>  		edx |= cpufeat_mask(X86_FEATURE_APIC);
>  
> -		/* Allow the HYPERVISOR bit to be set via guest policy. */
> -		ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);

We also seem to force X86_FEATURE_APIC into the policy, which seems
wrong?

I guess all AMD hardware Xen boots on has the APIC feature, so this
isn't a real issue, but still seems quite weird.

>  		cpuidmask_defaults._1cd = ((uint64_t)ecx << 32) | edx;
>  	}
>  
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 28fefa1f81..316b801597 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -665,6 +665,8 @@ int arch_domain_create(struct domain *d,
>       */
>      d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8;
>  
> +    domain_cpu_policy_changed(d);
> +
>      return 0;
>  
>   fail:
> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 5ed63ac10a..0627eb4e06 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -48,7 +48,7 @@ static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
>  }
>  #endif
>  
> -static void domain_cpu_policy_changed(struct domain *d)
> +void domain_cpu_policy_changed(struct domain *d)
>  {
>      const struct cpuid_policy *p = d->arch.cpuid;
>      struct vcpu *v;
> @@ -106,6 +106,13 @@ static void domain_cpu_policy_changed(struct domain *d)
>                      ecx = 0;
>                  edx = cpufeat_mask(X86_FEATURE_APIC);
>  
> +                /*
> +                 * If the Hypervisor bit is set in the policy, we can also
> +                 * forward it into real CPUID.
> +                 */
> +                if ( p->basic.hypervisor )
> +                    ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);

Since the hypervisor bit will be part of both the HVM and PV max
policies, why do you need to explicitly allow it here?

Won't it be naturally added to the guest policy as the rest of
features?

Thanks, Roger.
Andrew Cooper Jan. 28, 2020, 11:21 a.m. UTC | #2
On 28/01/2020 10:39, Roger Pau Monné wrote:
>
>> This is one of two possible approaches, and both have their downsides.  This
>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>> they will usually differ in HYPERVISOR bit.
>>
>> The other approach is to order things more carefully so levelling is
>> configured after scanning for cpuid bits, but that has the downside that it is
>> very easy to regress.
>>
>> Thoughts on which is the least-bad approach to take?  Having written this
>> patch, I'm now erring on the side of doing it the other way.
>> ---
>>  xen/arch/x86/cpu/amd.c       | 3 ---
>>  xen/arch/x86/domain.c        | 2 ++
>>  xen/arch/x86/domctl.c        | 9 ++++++++-
>>  xen/include/asm-x86/domain.h | 2 ++
>>  4 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
>> index 8b5f0f2e4c..0906b23582 100644
>> --- a/xen/arch/x86/cpu/amd.c
>> +++ b/xen/arch/x86/cpu/amd.c
>> @@ -297,9 +297,6 @@ static void __init noinline amd_init_levelling(void)
>>  			ecx |= cpufeat_mask(X86_FEATURE_OSXSAVE);
>>  		edx |= cpufeat_mask(X86_FEATURE_APIC);
>>  
>> -		/* Allow the HYPERVISOR bit to be set via guest policy. */
>> -		ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
> We also seem to force X86_FEATURE_APIC into the policy, which seems
> wrong?
>
> I guess all AMD hardware Xen boots on has the APIC feature, so this
> isn't a real issue, but still seems quite weird.

The comment just out of context explains why.

The APIC bit is special (fast forwarded from MSR_APIC_BASE.EN) and for
the fast-forwarding to work correctly, the bit needs to remain
unconditionally set in the "mask" MSR.

>
>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>> index 5ed63ac10a..0627eb4e06 100644
>> --- a/xen/arch/x86/domctl.c
>> +++ b/xen/arch/x86/domctl.c
>> @@ -48,7 +48,7 @@ static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
>>  }
>>  #endif
>>  
>> -static void domain_cpu_policy_changed(struct domain *d)
>> +void domain_cpu_policy_changed(struct domain *d)
>>  {
>>      const struct cpuid_policy *p = d->arch.cpuid;
>>      struct vcpu *v;
>> @@ -106,6 +106,13 @@ static void domain_cpu_policy_changed(struct domain *d)
>>                      ecx = 0;
>>                  edx = cpufeat_mask(X86_FEATURE_APIC);
>>  
>> +                /*
>> +                 * If the Hypervisor bit is set in the policy, we can also
>> +                 * forward it into real CPUID.
>> +                 */
>> +                if ( p->basic.hypervisor )
>> +                    ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
> Since the hypervisor bit will be part of both the HVM and PV max
> policies, why do you need to explicitly allow it here?
>
> Won't it be naturally added to the guest policy as the rest of
> features?

cpuidmask_defaults serves as an and-mask over any content the guest
policy selects.

This is because, in the AMD case, these are actually override MSRs
rather than mask MSRs.  Care has to be taken not to advertise any
features the pipeline can't deliver on, but it is also the properly
which allows us to advertise the HYPERVISOR bit in general.

Previously, HYPERVISOR was within the and mask, so if the guest policy
requests it (which it will by default, but can be turned with sufficient
cpuid= configuration), it would get included.  With this change,
HYPERVISOR is now clear in the mask, so needs explicitly adding back in.

~Andrew
Roger Pau Monné Jan. 28, 2020, 11:58 a.m. UTC | #3
On Tue, Jan 28, 2020 at 11:21:14AM +0000, Andrew Cooper wrote:
> On 28/01/2020 10:39, Roger Pau Monné wrote:
> >
> >> This is one of two possible approaches, and both have their downsides.  This
> >> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
> >> they will usually differ in HYPERVISOR bit.
> >>
> >> The other approach is to order things more carefully so levelling is
> >> configured after scanning for cpuid bits, but that has the downside that it is
> >> very easy to regress.

Would it be enough to populate boot_cpu_data before setting
cpuidmask_defaults?

So that at least the data in boot_cpu_data is not tainted with the
bits in cpuidmask_defaults?

I'm certainly not an expert on that area, so your judgment on the
least bad approach is likely more accurate than mine.

> >> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> >> index 5ed63ac10a..0627eb4e06 100644
> >> --- a/xen/arch/x86/domctl.c
> >> +++ b/xen/arch/x86/domctl.c
> >> @@ -48,7 +48,7 @@ static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
> >>  }
> >>  #endif
> >>  
> >> -static void domain_cpu_policy_changed(struct domain *d)
> >> +void domain_cpu_policy_changed(struct domain *d)
> >>  {
> >>      const struct cpuid_policy *p = d->arch.cpuid;
> >>      struct vcpu *v;
> >> @@ -106,6 +106,13 @@ static void domain_cpu_policy_changed(struct domain *d)
> >>                      ecx = 0;
> >>                  edx = cpufeat_mask(X86_FEATURE_APIC);
> >>  
> >> +                /*
> >> +                 * If the Hypervisor bit is set in the policy, we can also
> >> +                 * forward it into real CPUID.
> >> +                 */
> >> +                if ( p->basic.hypervisor )
> >> +                    ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
> > Since the hypervisor bit will be part of both the HVM and PV max
> > policies, why do you need to explicitly allow it here?
> >
> > Won't it be naturally added to the guest policy as the rest of
> > features?
> 
> cpuidmask_defaults serves as an and-mask over any content the guest
> policy selects.
>
> This is because, in the AMD case, these are actually override MSRs
> rather than mask MSRs.  Care has to be taken not to advertise any
> features the pipeline can't deliver on, but it is also the properly
> which allows us to advertise the HYPERVISOR bit in general.

Oh, so on AMD cpuidmask_defaults is not a mask, it can also force
features to appear on cpuid, even when not present on the hardware
cpuid.

Would it make sense to and cpuidmask_defaults with the real hardware
policy before writing it to the MSR in order to prevent things not
present on the hardware policy from appearing magically?

Thanks, Roger.
Andrew Cooper Jan. 28, 2020, 12:13 p.m. UTC | #4
On 28/01/2020 11:58, Roger Pau Monné wrote:
> On Tue, Jan 28, 2020 at 11:21:14AM +0000, Andrew Cooper wrote:
>> On 28/01/2020 10:39, Roger Pau Monné wrote:
>>>> This is one of two possible approaches, and both have their downsides.  This
>>>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>>>> they will usually differ in HYPERVISOR bit.
>>>>
>>>> The other approach is to order things more carefully so levelling is
>>>> configured after scanning for cpuid bits, but that has the downside that it is
>>>> very easy to regress.
> Would it be enough to populate boot_cpu_data before setting
> cpuidmask_defaults?
>
> So that at least the data in boot_cpu_data is not tainted with the
> bits in cpuidmask_defaults?

That was the discussion in the RFC section, where I also said I think I
was preferring that way of doing it anyway...

>
> I'm certainly not an expert on that area, so your judgment on the
> least bad approach is likely more accurate than mine.
>
>>>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>>>> index 5ed63ac10a..0627eb4e06 100644
>>>> --- a/xen/arch/x86/domctl.c
>>>> +++ b/xen/arch/x86/domctl.c
>>>> @@ -48,7 +48,7 @@ static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
>>>>  }
>>>>  #endif
>>>>  
>>>> -static void domain_cpu_policy_changed(struct domain *d)
>>>> +void domain_cpu_policy_changed(struct domain *d)
>>>>  {
>>>>      const struct cpuid_policy *p = d->arch.cpuid;
>>>>      struct vcpu *v;
>>>> @@ -106,6 +106,13 @@ static void domain_cpu_policy_changed(struct domain *d)
>>>>                      ecx = 0;
>>>>                  edx = cpufeat_mask(X86_FEATURE_APIC);
>>>>  
>>>> +                /*
>>>> +                 * If the Hypervisor bit is set in the policy, we can also
>>>> +                 * forward it into real CPUID.
>>>> +                 */
>>>> +                if ( p->basic.hypervisor )
>>>> +                    ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
>>> Since the hypervisor bit will be part of both the HVM and PV max
>>> policies, why do you need to explicitly allow it here?
>>>
>>> Won't it be naturally added to the guest policy as the rest of
>>> features?
>> cpuidmask_defaults serves as an and-mask over any content the guest
>> policy selects.
>>
>> This is because, in the AMD case, these are actually override MSRs
>> rather than mask MSRs.  Care has to be taken not to advertise any
>> features the pipeline can't deliver on, but it is also the properly
>> which allows us to advertise the HYPERVISOR bit in general.
> Oh, so on AMD cpuidmask_defaults is not a mask, it can also force
> features to appear on cpuid, even when not present on the hardware
> cpuid.

Correct.

> Would it make sense to and cpuidmask_defaults with the real hardware
> policy before writing it to the MSR in order to prevent things not
> present on the hardware policy from appearing magically?

cpumask_defaults already are real hardware values (give or take the
special bits, and possibly imitating an older CPU via a cpuid_mask_cpu=
setting).

The point of doing it like that is so we don't have to perform redundant
calculations on every context switch.

~Andrew
Jan Beulich Jan. 28, 2020, 1:59 p.m. UTC | #5
On 27.01.2020 21:21, Andrew Cooper wrote:
> Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
> configured with the HYPERVISOR bit before native CPUID is scanned for feature
> bits.
> 
> This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
> ends up appearing in the raw and host CPU policies.  Nothing has really cared
> in the past.
> 
> Alter amd_init_levelling() to exclude the HYPERVISOR bit from
> cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
> explicitly forwarded.
> 
> This in turn highlighted that dom0 construction was asymetric with domU
> construction, by not having any calls to domain_cpu_policy_changed().  Extend
> arch_domain_create() to always call domain_cpu_policy_changed().
> 
> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wl@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Igor Druzhinin <igor.druzhinin@citrix.com>
> 
> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
> I haven't investgiated the issue with that patch specifically, because
> cpu_has_hypervisor being wrong is obviously a bug.
> 
> This is one of two possible approaches, and both have their downsides.  This
> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
> they will usually differ in HYPERVISOR bit.

Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit
off the top of my head I can't recall us special casing idle wrt
CPUID handling), but why for PV vs HVM? The idle case, if there is
an issue with this, could be taken care of by actually setting the
bit there, as no-one should care about what it's set to?

> The other approach is to order things more carefully so levelling is
> configured after scanning for cpuid bits, but that has the downside that it is
> very easy to regress.
> 
> Thoughts on which is the least-bad approach to take?  Having written this
> patch, I'm now erring on the side of doing it the other way.

Besides the need for me to understand the aspect above, I'm afraid
to judge I'd need to have at least a sketch of what the alternative
would look like, in particular to figure how fragile it really is.

Jan
Andrew Cooper Jan. 28, 2020, 5:14 p.m. UTC | #6
On 28/01/2020 13:59, Jan Beulich wrote:
> On 27.01.2020 21:21, Andrew Cooper wrote:
>> Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
>> configured with the HYPERVISOR bit before native CPUID is scanned for feature
>> bits.
>>
>> This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
>> ends up appearing in the raw and host CPU policies.  Nothing has really cared
>> in the past.
>>
>> Alter amd_init_levelling() to exclude the HYPERVISOR bit from
>> cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
>> explicitly forwarded.
>>
>> This in turn highlighted that dom0 construction was asymetric with domU
>> construction, by not having any calls to domain_cpu_policy_changed().  Extend
>> arch_domain_create() to always call domain_cpu_policy_changed().
>>
>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Wei Liu <wl@xen.org>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Igor Druzhinin <igor.druzhinin@citrix.com>
>>
>> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
>> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
>> I haven't investgiated the issue with that patch specifically, because
>> cpu_has_hypervisor being wrong is obviously a bug.
>>
>> This is one of two possible approaches, and both have their downsides.  This
>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>> they will usually differ in HYPERVISOR bit.
> Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit
> off the top of my head I can't recall us special casing idle wrt
> CPUID handling), but why for PV vs HVM? The idle case, if there is
> an issue with this, could be taken care of by actually setting the
> bit there, as no-one should care about what it's set to?

d->arch.pv.cpuidmasks is only allocated for PV domains (and starts by
dup()ing the default).

When context switching levelling MSRs, any non-PV guest uses
cpumask_default.  This captures idle and HVM vcpus.

This is necessary because, at least at the time it was introduced,
{pv,hvm}_cpuid() issued native CPUID instructions to then feed data back
into guest context.  Its probably less relevant now that guest_cpuid()
doesn't issue native instructions in the general case.

Either way, HVM gained the default like idle, to cause the lazy
switching logic to switch less often.

The problem we have after this patch is that default differs from PV in
the HYPERVISOR bit, which basically guarantees that we rewrite the leaf
1 levelling on each context switch.

However, having looked at the other features bits which differ for PV,
VME and PSE36 being hidden means we're always switching leaf 1 anyway,
so this change for HYPERVISOR doesn't make the situation any worse.

>> The other approach is to order things more carefully so levelling is
>> configured after scanning for cpuid bits, but that has the downside that it is
>> very easy to regress.
>>
>> Thoughts on which is the least-bad approach to take?  Having written this
>> patch, I'm now erring on the side of doing it the other way.
> Besides the need for me to understand the aspect above, I'm afraid
> to judge I'd need to have at least a sketch of what the alternative
> would look like, in particular to figure how fragile it really is.

It would be a small bit of careful reordering in cpu/amd.c

The tipping factor is that, even if we arrange for idle context not to
have HYPERVISOR set (and therefore cpu_has_hypervisor ending up clear
when scanned), a regular CPUID instruction in PV context would see
HYPERVISOR as a property of virtualising things sensibly for guests.

As we need to cope with HYPERVISOR being visible in some contexts, its
better to consider it uniformly visible and break any kind of notional
link between cpu_has_hypervisor matching what CPUID would see as the bit.

~Andrew
Jan Beulich Jan. 29, 2020, 8:17 a.m. UTC | #7
On 28.01.2020 18:14, Andrew Cooper wrote:
> On 28/01/2020 13:59, Jan Beulich wrote:
>> On 27.01.2020 21:21, Andrew Cooper wrote:
>>> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
>>> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
>>> I haven't investgiated the issue with that patch specifically, because
>>> cpu_has_hypervisor being wrong is obviously a bug.
>>>
>>> This is one of two possible approaches, and both have their downsides.  This
>>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>>> they will usually differ in HYPERVISOR bit.
>> Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit
>> off the top of my head I can't recall us special casing idle wrt
>> CPUID handling), but why for PV vs HVM? The idle case, if there is
>> an issue with this, could be taken care of by actually setting the
>> bit there, as no-one should care about what it's set to?
> 
> d->arch.pv.cpuidmasks is only allocated for PV domains (and starts by
> dup()ing the default).

Ah, that's the piece I was missing. My next question then is: Why
do we do *_init_levelling() from early_init_*() in the first place?
It looks conceptually wrong to me to set up leveling before having
obtained CPUID data. Wouldn't there better be a separate hook in
struct cpu_dev, to be invoked e.g. from identify_cpu() after
generic_identify()?

> When context switching levelling MSRs, any non-PV guest uses
> cpumask_default.  This captures idle and HVM vcpus.
> 
> This is necessary because, at least at the time it was introduced,
> {pv,hvm}_cpuid() issued native CPUID instructions to then feed data back
> into guest context.  Its probably less relevant now that guest_cpuid()
> doesn't issue native instructions in the general case.
> 
> Either way, HVM gained the default like idle, to cause the lazy
> switching logic to switch less often.
> 
> The problem we have after this patch is that default differs from PV in
> the HYPERVISOR bit, which basically guarantees that we rewrite the leaf
> 1 levelling on each context switch.
> 
> However, having looked at the other features bits which differ for PV,
> VME and PSE36 being hidden means we're always switching leaf 1 anyway,
> so this change for HYPERVISOR doesn't make the situation any worse.
> 
>>> The other approach is to order things more carefully so levelling is
>>> configured after scanning for cpuid bits, but that has the downside that it is
>>> very easy to regress.
>>>
>>> Thoughts on which is the least-bad approach to take?  Having written this
>>> patch, I'm now erring on the side of doing it the other way.
>> Besides the need for me to understand the aspect above, I'm afraid
>> to judge I'd need to have at least a sketch of what the alternative
>> would look like, in particular to figure how fragile it really is.
> 
> It would be a small bit of careful reordering in cpu/amd.c
> 
> The tipping factor is that, even if we arrange for idle context not to
> have HYPERVISOR set (and therefore cpu_has_hypervisor ending up clear
> when scanned), a regular CPUID instruction in PV context would see
> HYPERVISOR as a property of virtualising things sensibly for guests.
> 
> As we need to cope with HYPERVISOR being visible in some contexts, its
> better to consider it uniformly visible and break any kind of notional
> link between cpu_has_hypervisor matching what CPUID would see as the bit.

After having set up leveling I think we shouldn't use CPUID anymore
for leaves which may be leveled. As a result cpu_has_* / cpu_has()
would then be the only information source in this regard.

Such a general re-arrangement would then also appear to address your
"easy to regress" concern.

Jan
Andrew Cooper Jan. 29, 2020, 1:08 p.m. UTC | #8
On 29/01/2020 08:17, Jan Beulich wrote:
> On 28.01.2020 18:14, Andrew Cooper wrote:
>> On 28/01/2020 13:59, Jan Beulich wrote:
>>> On 27.01.2020 21:21, Andrew Cooper wrote:
>>>> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
>>>> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
>>>> I haven't investgiated the issue with that patch specifically, because
>>>> cpu_has_hypervisor being wrong is obviously a bug.
>>>>
>>>> This is one of two possible approaches, and both have their downsides.  This
>>>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>>>> they will usually differ in HYPERVISOR bit.
>>> Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit
>>> off the top of my head I can't recall us special casing idle wrt
>>> CPUID handling), but why for PV vs HVM? The idle case, if there is
>>> an issue with this, could be taken care of by actually setting the
>>> bit there, as no-one should care about what it's set to?
>> d->arch.pv.cpuidmasks is only allocated for PV domains (and starts by
>> dup()ing the default).
> Ah, that's the piece I was missing. My next question then is: Why
> do we do *_init_levelling() from early_init_*() in the first place?
> It looks conceptually wrong to me to set up leveling before having
> obtained CPUID data. Wouldn't there better be a separate hook in
> struct cpu_dev, to be invoked e.g. from identify_cpu() after
> generic_identify()?

cpuid_mask_cpu= literally means "pretend you're this older CPU instead",
and was implemented in a way which affected Xen.  This is why levelling
is configured that early.

Now that this isn't the only way to make heterogeneous migration safe,
perhaps we don't care so much, but it would still be a behavioural
change to the cpuid_mask_* parameters.

>> When context switching levelling MSRs, any non-PV guest uses
>> cpumask_default.  This captures idle and HVM vcpus.
>>
>> This is necessary because, at least at the time it was introduced,
>> {pv,hvm}_cpuid() issued native CPUID instructions to then feed data back
>> into guest context.  Its probably less relevant now that guest_cpuid()
>> doesn't issue native instructions in the general case.
>>
>> Either way, HVM gained the default like idle, to cause the lazy
>> switching logic to switch less often.
>>
>> The problem we have after this patch is that default differs from PV in
>> the HYPERVISOR bit, which basically guarantees that we rewrite the leaf
>> 1 levelling on each context switch.
>>
>> However, having looked at the other features bits which differ for PV,
>> VME and PSE36 being hidden means we're always switching leaf 1 anyway,
>> so this change for HYPERVISOR doesn't make the situation any worse.
>>
>>>> The other approach is to order things more carefully so levelling is
>>>> configured after scanning for cpuid bits, but that has the downside that it is
>>>> very easy to regress.
>>>>
>>>> Thoughts on which is the least-bad approach to take?  Having written this
>>>> patch, I'm now erring on the side of doing it the other way.
>>> Besides the need for me to understand the aspect above, I'm afraid
>>> to judge I'd need to have at least a sketch of what the alternative
>>> would look like, in particular to figure how fragile it really is.
>> It would be a small bit of careful reordering in cpu/amd.c
>>
>> The tipping factor is that, even if we arrange for idle context not to
>> have HYPERVISOR set (and therefore cpu_has_hypervisor ending up clear
>> when scanned), a regular CPUID instruction in PV context would see
>> HYPERVISOR as a property of virtualising things sensibly for guests.
>>
>> As we need to cope with HYPERVISOR being visible in some contexts, its
>> better to consider it uniformly visible and break any kind of notional
>> link between cpu_has_hypervisor matching what CPUID would see as the bit.
> After having set up leveling I think we shouldn't use CPUID anymore
> for leaves which may be leveled. As a result cpu_has_* / cpu_has()
> would then be the only information source in this regard.
>
> Such a general re-arrangement would then also appear to address your
> "easy to regress" concern.

I've just had another thought, and it rules out other approaches.

We use ctxt_switch_levelling(NULL) on the crash path to reset things for
next kernel, and that needs to hide the HYPERVISOR bit on AMD.

Therefore, the approach in this patch is the only sensible action (and
I'm now not concerned about the performance hit, as it won't actually
increase the number of MSR writes we make).

I think I need to rewrite the commit message, but not the code.

~Andrew
Jan Beulich Jan. 29, 2020, 1:43 p.m. UTC | #9
On 29.01.2020 14:08, Andrew Cooper wrote:
> On 29/01/2020 08:17, Jan Beulich wrote:
>> On 28.01.2020 18:14, Andrew Cooper wrote:
>>> On 28/01/2020 13:59, Jan Beulich wrote:
>>>> On 27.01.2020 21:21, Andrew Cooper wrote:
>>>>> Without this fix, there is apparently a problem with Roger's "[PATCH v3 7/7]
>>>>> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD hardware.
>>>>> I haven't investgiated the issue with that patch specifically, because
>>>>> cpu_has_hypervisor being wrong is obviously a bug.
>>>>>
>>>>> This is one of two possible approaches, and both have their downsides.  This
>>>>> one takes an extra hit on context switches between PV vcpus and idle/hvm, as
>>>>> they will usually differ in HYPERVISOR bit.
>>>> Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit
>>>> off the top of my head I can't recall us special casing idle wrt
>>>> CPUID handling), but why for PV vs HVM? The idle case, if there is
>>>> an issue with this, could be taken care of by actually setting the
>>>> bit there, as no-one should care about what it's set to?
>>> d->arch.pv.cpuidmasks is only allocated for PV domains (and starts by
>>> dup()ing the default).
>> Ah, that's the piece I was missing. My next question then is: Why
>> do we do *_init_levelling() from early_init_*() in the first place?
>> It looks conceptually wrong to me to set up leveling before having
>> obtained CPUID data. Wouldn't there better be a separate hook in
>> struct cpu_dev, to be invoked e.g. from identify_cpu() after
>> generic_identify()?
> 
> cpuid_mask_cpu= literally means "pretend you're this older CPU instead",
> and was implemented in a way which affected Xen.  This is why levelling
> is configured that early.

It's indeed written like this in the cmdline doc, but I vaguely
recall questioning this behavior at least once before.

> Now that this isn't the only way to make heterogeneous migration safe,
> perhaps we don't care so much, but it would still be a behavioural
> change to the cpuid_mask_* parameters.

And indeed we have cpuid= now to control the features Xen is
to use. I don't fancy looking at bug reports where I first need
to sort out the interaction between these two cmdline options.

>>> When context switching levelling MSRs, any non-PV guest uses
>>> cpumask_default.  This captures idle and HVM vcpus.
>>>
>>> This is necessary because, at least at the time it was introduced,
>>> {pv,hvm}_cpuid() issued native CPUID instructions to then feed data back
>>> into guest context.  Its probably less relevant now that guest_cpuid()
>>> doesn't issue native instructions in the general case.
>>>
>>> Either way, HVM gained the default like idle, to cause the lazy
>>> switching logic to switch less often.
>>>
>>> The problem we have after this patch is that default differs from PV in
>>> the HYPERVISOR bit, which basically guarantees that we rewrite the leaf
>>> 1 levelling on each context switch.
>>>
>>> However, having looked at the other features bits which differ for PV,
>>> VME and PSE36 being hidden means we're always switching leaf 1 anyway,
>>> so this change for HYPERVISOR doesn't make the situation any worse.
>>>
>>>>> The other approach is to order things more carefully so levelling is
>>>>> configured after scanning for cpuid bits, but that has the downside that it is
>>>>> very easy to regress.
>>>>>
>>>>> Thoughts on which is the least-bad approach to take?  Having written this
>>>>> patch, I'm now erring on the side of doing it the other way.
>>>> Besides the need for me to understand the aspect above, I'm afraid
>>>> to judge I'd need to have at least a sketch of what the alternative
>>>> would look like, in particular to figure how fragile it really is.
>>> It would be a small bit of careful reordering in cpu/amd.c
>>>
>>> The tipping factor is that, even if we arrange for idle context not to
>>> have HYPERVISOR set (and therefore cpu_has_hypervisor ending up clear
>>> when scanned), a regular CPUID instruction in PV context would see
>>> HYPERVISOR as a property of virtualising things sensibly for guests.
>>>
>>> As we need to cope with HYPERVISOR being visible in some contexts, its
>>> better to consider it uniformly visible and break any kind of notional
>>> link between cpu_has_hypervisor matching what CPUID would see as the bit.
>> After having set up leveling I think we shouldn't use CPUID anymore
>> for leaves which may be leveled. As a result cpu_has_* / cpu_has()
>> would then be the only information source in this regard.
>>
>> Such a general re-arrangement would then also appear to address your
>> "easy to regress" concern.
> 
> I've just had another thought, and it rules out other approaches.
> 
> We use ctxt_switch_levelling(NULL) on the crash path to reset things for
> next kernel, and that needs to hide the HYPERVISOR bit on AMD.
> 
> Therefore, the approach in this patch is the only sensible action (and
> I'm now not concerned about the performance hit, as it won't actually
> increase the number of MSR writes we make).

I'm not fully convinced: Why is setting the MSRs back to
cpuidmask_defaults the correct thing to do in the first place?
Shouldn't we reset them to what we found on boot?

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 8b5f0f2e4c..0906b23582 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -297,9 +297,6 @@  static void __init noinline amd_init_levelling(void)
 			ecx |= cpufeat_mask(X86_FEATURE_OSXSAVE);
 		edx |= cpufeat_mask(X86_FEATURE_APIC);
 
-		/* Allow the HYPERVISOR bit to be set via guest policy. */
-		ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
-
 		cpuidmask_defaults._1cd = ((uint64_t)ecx << 32) | edx;
 	}
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 28fefa1f81..316b801597 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -665,6 +665,8 @@  int arch_domain_create(struct domain *d,
      */
     d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8;
 
+    domain_cpu_policy_changed(d);
+
     return 0;
 
  fail:
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 5ed63ac10a..0627eb4e06 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -48,7 +48,7 @@  static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
 }
 #endif
 
-static void domain_cpu_policy_changed(struct domain *d)
+void domain_cpu_policy_changed(struct domain *d)
 {
     const struct cpuid_policy *p = d->arch.cpuid;
     struct vcpu *v;
@@ -106,6 +106,13 @@  static void domain_cpu_policy_changed(struct domain *d)
                     ecx = 0;
                 edx = cpufeat_mask(X86_FEATURE_APIC);
 
+                /*
+                 * If the Hypervisor bit is set in the policy, we can also
+                 * forward it into real CPUID.
+                 */
+                if ( p->basic.hypervisor )
+                    ecx |= cpufeat_mask(X86_FEATURE_HYPERVISOR);
+
                 mask |= ((uint64_t)ecx << 32) | edx;
                 break;
             }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index a3ae5d9a20..817065bf81 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -624,6 +624,8 @@  struct guest_memory_policy
 void update_guest_memory_policy(struct vcpu *v,
                                 struct guest_memory_policy *policy);
 
+void domain_cpu_policy_changed(struct domain *d);
+
 bool update_runstate_area(struct vcpu *);
 bool update_secondary_system_time(struct vcpu *,
                                   struct vcpu_time_info *);