diff mbox series

[v6,04/16] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

Message ID 20210511024214.280733-5-like.xu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS | expand

Commit Message

Like Xu May 11, 2021, 2:42 a.m. UTC
On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
detect whether the processor supports performance monitoring facility.

It depends on the PMU is enabled for the guest, and a software write
operation to this available bit will be ignored.

Cc: Yao Yuan <yuan.yao@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 1 +
 arch/x86/kvm/x86.c           | 1 +
 2 files changed, 2 insertions(+)

Comments

Venkatesh Srinivas May 12, 2021, 1:58 a.m. UTC | #1
On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> detect whether the processor supports performance monitoring facility.
>
> It depends on the PMU is enabled for the guest, and a software write
> operation to this available bit will be ignored.

Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
documented someplace?

Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org>

> Cc: Yao Yuan <yuan.yao@intel.com>
> Signed-off-by: Like Xu <like.xu@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/pmu_intel.c | 1 +
>  arch/x86/kvm/x86.c           | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 9efc1a6b8693..d9dbebe03cae 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  	if (!pmu->version)
>  		return;
>
> +	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>  	perf_get_x86_pmu_capability(&x86_pmu);
>
>  	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5bd550eaf683..abe3ea69078c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> msr_data *msr_info)
>  		}
>  		break;
>  	case MSR_IA32_MISC_ENABLE:
> +		data &= ~MSR_IA32_MISC_ENABLE_EMON;
>  		if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)
> &&
>  		    ((vcpu->arch.ia32_misc_enable_msr ^ data) &
> MSR_IA32_MISC_ENABLE_MWAIT)) {
>  			if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
> --
> 2.31.1
>
>
Xu, Like May 12, 2021, 5 a.m. UTC | #2
Hi Venkatesh Srinivas,

On 2021/5/12 9:58, Venkatesh Srinivas wrote:
> On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>> detect whether the processor supports performance monitoring facility.
>>
>> It depends on the PMU is enabled for the guest, and a software write
>> operation to this available bit will be ignored.
> Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
> documented someplace?

The bit[7] behavior of the real hardware on the native host is quite 
suspicious.

To keep the semantics consistent and simple, we propose ignoring write 
operation
in the virtualized world, since whether or not to expose PMU is configured 
by the
hypervisor user space and not by the guest side.

I assume your "reviewed-by" also points this out. Thanks.

>
> Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org>
>
>> Cc: Yao Yuan <yuan.yao@intel.com>
>> Signed-off-by: Like Xu <like.xu@linux.intel.com>
>> ---
>>   arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>   arch/x86/kvm/x86.c           | 1 +
>>   2 files changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 9efc1a6b8693..d9dbebe03cae 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>   	if (!pmu->version)
>>   		return;
>>
>> +	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>   	perf_get_x86_pmu_capability(&x86_pmu);
>>
>>   	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 5bd550eaf683..abe3ea69078c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>> msr_data *msr_info)
>>   		}
>>   		break;
>>   	case MSR_IA32_MISC_ENABLE:
>> +		data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>   		if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)
>> &&
>>   		    ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>   			if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>> --
>> 2.31.1
>>
>>
Sean Christopherson May 12, 2021, 3:18 p.m. UTC | #3
On Wed, May 12, 2021, Xu, Like wrote:
> Hi Venkatesh Srinivas,
> 
> On 2021/5/12 9:58, Venkatesh Srinivas wrote:
> > On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
> > > On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> > > detect whether the processor supports performance monitoring facility.
> > > 
> > > It depends on the PMU is enabled for the guest, and a software write
> > > operation to this available bit will be ignored.
> > Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
> > documented someplace?
> 
> The bit[7] behavior of the real hardware on the native host is quite
> suspicious.

Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
current phrasing is a mess:

  Performance Monitoring Available (R)
  1 = Performance monitoring enabled.
  0 = Performance monitoring disabled.

The (R) is ambiguous because most other entries that are read-only use (RO), and
the "enabled vs. disabled" implies the bit is writable and really does control
the PMU.  But on my Haswell system, it's read-only.  Assuming the bit is supposed
to be a read-only "PMU supported bit", the SDM should be:

  Performance Monitoring Available (RO)
  1 = Performance monitoring supported.
  0 = Performance monitoring not supported.

And please update the changelog to explain the "why" of whatever the behavior
ends up being.  The "what" is obvious from the code.

> To keep the semantics consistent and simple, we propose ignoring write
> operation in the virtualized world, since whether or not to expose PMU is
> configured by the hypervisor user space and not by the guest side.

Making up our own architectural behavior because it's convient is not a good
idea.

> > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > index 9efc1a6b8693..d9dbebe03cae 100644
> > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> > >   	if (!pmu->version)
> > >   		return;
> > > 
> > > +	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;

Hmm, normally I would say overwriting the guest's value is a bad idea, but if
the bit really is a read-only "PMU supported" bit, then this is the correct
behavior, albeit weird if userspace does a late CPUID update (though that's
weird no matter what).

> > >   	perf_get_x86_pmu_capability(&x86_pmu);
> > > 
> > >   	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5bd550eaf683..abe3ea69078c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> > > msr_data *msr_info)
> > >   		}
> > >   		break;
> > >   	case MSR_IA32_MISC_ENABLE:
> > > +		data &= ~MSR_IA32_MISC_ENABLE_EMON;

However, this is not.  If it's a read-only bit, then toggling the bit should
cause a #GP.

> > >   		if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)
> > > &&
> > >   		    ((vcpu->arch.ia32_misc_enable_msr ^ data) &
> > > MSR_IA32_MISC_ENABLE_MWAIT)) {
> > >   			if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
> > > --
Xu, Like May 13, 2021, 2:50 a.m. UTC | #4
On 2021/5/12 23:18, Sean Christopherson wrote:
> On Wed, May 12, 2021, Xu, Like wrote:
>> Hi Venkatesh Srinivas,
>>
>> On 2021/5/12 9:58, Venkatesh Srinivas wrote:
>>> On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>>>> detect whether the processor supports performance monitoring facility.
>>>>
>>>> It depends on the PMU is enabled for the guest, and a software write
>>>> operation to this available bit will be ignored.
>>> Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
>>> documented someplace?
>> The bit[7] behavior of the real hardware on the native host is quite
>> suspicious.
> Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
> current phrasing is a mess:
>
>    Performance Monitoring Available (R)
>    1 = Performance monitoring enabled.
>    0 = Performance monitoring disabled.
>
> The (R) is ambiguous because most other entries that are read-only use (RO), and
> the "enabled vs. disabled" implies the bit is writable and really does control
> the PMU.  But on my Haswell system, it's read-only.

On your Haswell system, does it cause #GP or just silent if you change this 
bit ?

> Assuming the bit is supposed
> to be a read-only "PMU supported bit", the SDM should be:
>
>    Performance Monitoring Available (RO)
>    1 = Performance monitoring supported.
>    0 = Performance monitoring not supported.
>
> And please update the changelog to explain the "why" of whatever the behavior
> ends up being.  The "what" is obvious from the code.

Thanks for your "why" comment.

>
>> To keep the semantics consistent and simple, we propose ignoring write
>> operation in the virtualized world, since whether or not to expose PMU is
>> configured by the hypervisor user space and not by the guest side.
> Making up our own architectural behavior because it's convient is not a good
> idea.

Sometime we do change it.

For example, the scope of some msrs may be "core level share"
but we likely keep it as a "thread level" variable in the KVM out of 
convenience.

>
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>    	if (!pmu->version)
>>>>    		return;
>>>>
>>>> +	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
> Hmm, normally I would say overwriting the guest's value is a bad idea, but if
> the bit really is a read-only "PMU supported" bit, then this is the correct
> behavior, albeit weird if userspace does a late CPUID update (though that's
> weird no matter what).
>
>>>>    	perf_get_x86_pmu_capability(&x86_pmu);
>>>>
>>>>    	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 5bd550eaf683..abe3ea69078c 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>>>> msr_data *msr_info)
>>>>    		}
>>>>    		break;
>>>>    	case MSR_IA32_MISC_ENABLE:
>>>> +		data &= ~MSR_IA32_MISC_ENABLE_EMON;
> However, this is not.  If it's a read-only bit, then toggling the bit should
> cause a #GP.

The proposal here is trying to make it as an
unchangeable bit and don't make it #GP if guest changes it.

It may different from the host behavior but
it doesn't cause potential issue if some guest code
changes it during the use of performance monitoring.

Does this make sense to you or do you want to
keep it strictly the same as the host side?

>
>>>>    		if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)
>>>> &&
>>>>    		    ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>    			if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>> --
Venkatesh Srinivas May 17, 2021, 6:43 p.m. UTC | #5
On Wed, May 12, 2021 at 7:50 PM Xu, Like <like.xu@intel.com> wrote:
>
> On 2021/5/12 23:18, Sean Christopherson wrote:
> > On Wed, May 12, 2021, Xu, Like wrote:
> >> Hi Venkatesh Srinivas,
> >>
> >> On 2021/5/12 9:58, Venkatesh Srinivas wrote:
> >>> On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
> >>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> >>>> detect whether the processor supports performance monitoring facility.
> >>>>
> >>>> It depends on the PMU is enabled for the guest, and a software write
> >>>> operation to this available bit will be ignored.
> >>> Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
> >>> documented someplace?
> >> The bit[7] behavior of the real hardware on the native host is quite
> >> suspicious.
> > Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
> > current phrasing is a mess:
> >
> >    Performance Monitoring Available (R)
> >    1 = Performance monitoring enabled.
> >    0 = Performance monitoring disabled.
> >
> > The (R) is ambiguous because most other entries that are read-only use (RO), and
> > the "enabled vs. disabled" implies the bit is writable and really does control
> > the PMU.  But on my Haswell system, it's read-only.
>
> On your Haswell system, does it cause #GP or just silent if you change this
> bit ?
>
> > Assuming the bit is supposed
> > to be a read-only "PMU supported bit", the SDM should be:
> >
> >    Performance Monitoring Available (RO)
> >    1 = Performance monitoring supported.
> >    0 = Performance monitoring not supported.

Can't speak to Haswell, but on Apollo Lake/Goldmont, this bit is _not_
set natively
and we get a #GP when attempting to set it, even though the PMU is available.

Should this bit be conditional on the host having it set?

> >
> > And please update the changelog to explain the "why" of whatever the behavior
> > ends up being.  The "what" is obvious from the code.
>
> Thanks for your "why" comment.
>
> >
> >> To keep the semantics consistent and simple, we propose ignoring write
> >> operation in the virtualized world, since whether or not to expose PMU is
> >> configured by the hypervisor user space and not by the guest side.
> > Making up our own architectural behavior because it's convient is not a good
> > idea.
>
> Sometime we do change it.
>
> For example, the scope of some msrs may be "core level share"
> but we likely keep it as a "thread level" variable in the KVM out of
> convenience.
>
> >
> >>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> index 9efc1a6b8693..d9dbebe03cae 100644
> >>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
> >>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> >>>>            if (!pmu->version)
> >>>>                    return;
> >>>>
> >>>> +  vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
> > Hmm, normally I would say overwriting the guest's value is a bad idea, but if
> > the bit really is a read-only "PMU supported" bit, then this is the correct
> > behavior, albeit weird if userspace does a late CPUID update (though that's
> > weird no matter what).
> >
> >>>>            perf_get_x86_pmu_capability(&x86_pmu);
> >>>>
> >>>>            pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>> index 5bd550eaf683..abe3ea69078c 100644
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> >>>> msr_data *msr_info)
> >>>>                    }
> >>>>                    break;
> >>>>            case MSR_IA32_MISC_ENABLE:
> >>>> +          data &= ~MSR_IA32_MISC_ENABLE_EMON;
> > However, this is not.  If it's a read-only bit, then toggling the bit should
> > cause a #GP.
>
> The proposal here is trying to make it as an
> unchangeable bit and don't make it #GP if guest changes it.
>
> It may different from the host behavior but
> it doesn't cause potential issue if some guest code
> changes it during the use of performance monitoring.
>
> Does this make sense to you or do you want to
> keep it strictly the same as the host side?
>
> >
> >>>>                    if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)
> >>>> &&
> >>>>                        ((vcpu->arch.ia32_misc_enable_msr ^ data) &
> >>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
> >>>>                            if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
> >>>> --
>
Sean Christopherson May 17, 2021, 9:16 p.m. UTC | #6
On Thu, May 13, 2021, Xu, Like wrote:
> On 2021/5/12 23:18, Sean Christopherson wrote:
> > On Wed, May 12, 2021, Xu, Like wrote:
> > > Hi Venkatesh Srinivas,
> > > 
> > > On 2021/5/12 9:58, Venkatesh Srinivas wrote:
> > > > On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
> > > > > On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> > > > > detect whether the processor supports performance monitoring facility.
> > > > > 
> > > > > It depends on the PMU is enabled for the guest, and a software write
> > > > > operation to this available bit will be ignored.
> > > > Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
> > > > documented someplace?
> > > The bit[7] behavior of the real hardware on the native host is quite
> > > suspicious.
> > Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
> > current phrasing is a mess:
> > 
> >    Performance Monitoring Available (R)
> >    1 = Performance monitoring enabled.
> >    0 = Performance monitoring disabled.
> > 
> > The (R) is ambiguous because most other entries that are read-only use (RO), and
> > the "enabled vs. disabled" implies the bit is writable and really does control
> > the PMU.  But on my Haswell system, it's read-only.
> 
> On your Haswell system, does it cause #GP or just silent if you change this
> bit ?

Attempting to clear the bit generates a #GP.

> > Assuming the bit is supposed
> > to be a read-only "PMU supported bit", the SDM should be:
> > 
> >    Performance Monitoring Available (RO)
> >    1 = Performance monitoring supported.
> >    0 = Performance monitoring not supported.
> > 
> > And please update the changelog to explain the "why" of whatever the behavior
> > ends up being.  The "what" is obvious from the code.
> 
> Thanks for your "why" comment.
> 
> > 
> > > To keep the semantics consistent and simple, we propose ignoring write
> > > operation in the virtualized world, since whether or not to expose PMU is
> > > configured by the hypervisor user space and not by the guest side.
> > Making up our own architectural behavior because it's convient is not a good
> > idea.
> 
> Sometime we do change it.
> 
> For example, the scope of some msrs may be "core level share"
> but we likely keep it as a "thread level" variable in the KVM out of
> convenience.

Thread vs. core scope is not architectural behavior.  Maybe you could argue that
it is for architectural MSRs, but even that is tenuous, e.g. SPEC_CTRL has this:

  The MSR bits are defined as logical processor scope. On some core
  implementations, the bits may impact sibling logical processors on the same core.

Regardless, the flaws of an inaccurate virtual CPU topology are well known, and
are a far cry from directly violating the SDM (assuming the SDM is fixed...).

> > > > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > > > index 9efc1a6b8693..d9dbebe03cae 100644
> > > > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > > > @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> > > > >    	if (!pmu->version)
> > > > >    		return;
> > > > > 
> > > > > +	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
> > Hmm, normally I would say overwriting the guest's value is a bad idea, but if
> > the bit really is a read-only "PMU supported" bit, then this is the correct
> > behavior, albeit weird if userspace does a late CPUID update (though that's
> > weird no matter what).
> > 
> > > > >    	perf_get_x86_pmu_capability(&x86_pmu);
> > > > > 
> > > > >    	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 5bd550eaf683..abe3ea69078c 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -3211,6 +3211,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> > > > > msr_data *msr_info)
> > > > >    		}
> > > > >    		break;
> > > > >    	case MSR_IA32_MISC_ENABLE:
> > > > > +		data &= ~MSR_IA32_MISC_ENABLE_EMON;
> > However, this is not.  If it's a read-only bit, then toggling the bit should
> > cause a #GP.
> 
> The proposal here is trying to make it as an unchangeable bit and don't make
> it #GP if guest changes it.
> 
> It may different from the host behavior but it doesn't cause potential issue
> if some guest code changes it during the use of performance monitoring.
> 
> Does this make sense to you or do you want to keep it strictly the same as
> the host side?

Strictly the same as bare metal.  I don't see any reason to eat writes from the
guest.
Sean Christopherson May 17, 2021, 9:19 p.m. UTC | #7
On Mon, May 17, 2021, Venkatesh Srinivas wrote:
> Should this bit be conditional on the host having it set?

No need, KVM advertises the architectural PMU to userspace iff hardware itself
has an architecture PMU.  Userspace is free to lie to its guests so long as doing
so doesn't put KVM at risk.
Sean Christopherson May 17, 2021, 11:51 p.m. UTC | #8
On Mon, May 17, 2021, Sean Christopherson wrote:
> On Thu, May 13, 2021, Xu, Like wrote:
> > On 2021/5/12 23:18, Sean Christopherson wrote:
> > > On Wed, May 12, 2021, Xu, Like wrote:
> > > > Hi Venkatesh Srinivas,
> > > > 
> > > > On 2021/5/12 9:58, Venkatesh Srinivas wrote:
> > > > > On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
> > > > > > On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> > > > > > detect whether the processor supports performance monitoring facility.
> > > > > > 
> > > > > > It depends on the PMU is enabled for the guest, and a software write
> > > > > > operation to this available bit will be ignored.
> > > > > Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
> > > > > documented someplace?
> > > > The bit[7] behavior of the real hardware on the native host is quite
> > > > suspicious.
> > > Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
> > > current phrasing is a mess:
> > > 
> > >    Performance Monitoring Available (R)
> > >    1 = Performance monitoring enabled.
> > >    0 = Performance monitoring disabled.
> > > 
> > > The (R) is ambiguous because most other entries that are read-only use (RO), and
> > > the "enabled vs. disabled" implies the bit is writable and really does control
> > > the PMU.  But on my Haswell system, it's read-only.
> > 
> > On your Haswell system, does it cause #GP or just silent if you change this
> > bit ?
> 
> Attempting to clear the bit generates a #GP.

*sigh*

Venkatesh and I are exhausting our brown paper bag supply.

Attempting to clear bit 7 is ignored on both Haswell and Goldmont.  This _no_ #GP,
the toggle is simply ignored.  I forgot to specify hex format (multiple times),
and Venkatesh accessed the wrong MSR (0x10a instead of 0x1a0).

So your proposal to ignore the toggle in KVM is the way to go, but please
document in the changelog that that behavior matches bare metal.

It would be nice to get the SDM cleaned up to use "supported/unsupported", and to
pick one of (R), (RO), and (R/O) for all MSRs entries for consistency, but that
may be a pipe dream.

Sorry for the run-around :-/
Xu, Like May 18, 2021, 7:49 a.m. UTC | #9
On 2021/5/18 7:51, Sean Christopherson wrote:
> On Mon, May 17, 2021, Sean Christopherson wrote:
>> On Thu, May 13, 2021, Xu, Like wrote:
>>> On 2021/5/12 23:18, Sean Christopherson wrote:
>>>> On Wed, May 12, 2021, Xu, Like wrote:
>>>>> Hi Venkatesh Srinivas,
>>>>>
>>>>> On 2021/5/12 9:58, Venkatesh Srinivas wrote:
>>>>>> On 5/10/21, Like Xu <like.xu@linux.intel.com> wrote:
>>>>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>>>>>>> detect whether the processor supports performance monitoring facility.
>>>>>>>
>>>>>>> It depends on the PMU is enabled for the guest, and a software write
>>>>>>> operation to this available bit will be ignored.
>>>>>> Is the behavior that writes to IA32_MISC_ENABLE[7] are ignored (rather than #GP)
>>>>>> documented someplace?
>>>>> The bit[7] behavior of the real hardware on the native host is quite
>>>>> suspicious.
>>>> Ugh.  Can you file an SDM bug to get the wording and accessibility updated?  The
>>>> current phrasing is a mess:
>>>>
>>>>     Performance Monitoring Available (R)
>>>>     1 = Performance monitoring enabled.
>>>>     0 = Performance monitoring disabled.
>>>>
>>>> The (R) is ambiguous because most other entries that are read-only use (RO), and
>>>> the "enabled vs. disabled" implies the bit is writable and really does control
>>>> the PMU.  But on my Haswell system, it's read-only.
>>> On your Haswell system, does it cause #GP or just silent if you change this
>>> bit ?
>> Attempting to clear the bit generates a #GP.
> *sigh*
>
> Venkatesh and I are exhausting our brown paper bag supply.
>
> Attempting to clear bit 7 is ignored on both Haswell and Goldmont.  This _no_ #GP,
> the toggle is simply ignored.  I forgot to specify hex format (multiple times),
> and Venkatesh accessed the wrong MSR (0x10a instead of 0x1a0).

*sigh*

>
> So your proposal to ignore the toggle in KVM is the way to go, but please
> document in the changelog that that behavior matches bare metal.

Thank you, I will clearly state it in the commit message.

>
> It would be nice to get the SDM cleaned up to use "supported/unsupported", and to
> pick one of (R), (RO), and (R/O) for all MSRs entries for consistency, but that
> may be a pipe dream.

Glad you could review my code. I have reported this issue internally.

>
> Sorry for the run-around :-/
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9efc1a6b8693..d9dbebe03cae 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -488,6 +488,7 @@  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	if (!pmu->version)
 		return;
 
+	vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
 	perf_get_x86_pmu_capability(&x86_pmu);
 
 	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5bd550eaf683..abe3ea69078c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3211,6 +3211,7 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		}
 		break;
 	case MSR_IA32_MISC_ENABLE:
+		data &= ~MSR_IA32_MISC_ENABLE_EMON;
 		if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
 		    ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
 			if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))