diff mbox series

[1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

Message ID 20240124003858.3954822-2-mizhang@google.com (mailing list archive)
State New, archived
Headers show
Series minor fix on perf_capabilities in KVM/x86 | expand

Commit Message

Mingwei Zhang Jan. 24, 2024, 12:38 a.m. UTC
Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
Without this, there is an issue in live migration. In particular, to
migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
the target is unable to set the value. This creates confusions on the user
side.

Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
wrong on the kvm_get_msr_common() which just reads
vcpu->arch.perf_capabilities.

Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
early in VM setup time.

Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/cpuid.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Sean Christopherson Jan. 24, 2024, 3:49 p.m. UTC | #1
On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> Without this, there is an issue in live migration. In particular, to
> migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> the target is unable to set the value. This creates confusions on the user
> side.
> 
> Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> wrong on the kvm_get_msr_common() which just reads
> vcpu->arch.perf_capabilities.
> 
> Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> early in VM setup time.
> 
> Cc: Aaron Lewis <aaronlewis@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/cpuid.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..416bee03c42a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
>  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>  
> +	/* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> +	if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> +		vcpu->arch.perf_capabilities = 0;

No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
actually does something like that, but KVM overwriting guest state is almost
never a good thing.

I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
already fudging around this in our internal kernels, so I don't think there's a
need to carry a hack-a-fix for the destination kernel.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27e23714e960..fdef9d706d61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
        kvm_async_pf_hash_reset(vcpu);
 
-       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
        kvm_pmu_init(vcpu);
 
        vcpu->arch.pending_external_vector = -1;

>  	kvm_pmu_refresh(vcpu);
>  	vcpu->arch.cr4_guest_rsvd_bits =
>  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
> -- 
> 2.43.0.429.g432eaa2c6b-goog
>
Aaron Lewis Jan. 24, 2024, 9:04 p.m. UTC | #2
On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> > Without this, there is an issue in live migration. In particular, to
> > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> > the target is unable to set the value. This creates confusions on the user
> > side.
> >
> > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> > wrong on the kvm_get_msr_common() which just reads
> > vcpu->arch.perf_capabilities.
> >
> > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> > early in VM setup time.
> >
> > Cc: Aaron Lewis <aaronlewis@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/cpuid.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index adba49afb5fe..416bee03c42a 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >       vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> >       vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> >
> > +     /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> > +     if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > +             vcpu->arch.perf_capabilities = 0;
>
> No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> actually does something like that, but KVM overwriting guest state is almost
> never a good thing.
>
> I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> already fudging around this in our internal kernels, so I don't think there's a
> need to carry a hack-a-fix for the destination kernel.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 27e23714e960..fdef9d706d61 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         kvm_async_pf_hash_reset(vcpu);
>
> -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;

Yeah, that will fix the issue we are seeing.  The only thing that's
not clear to me is if userspace should expect KVM to set this or if
KVM should expect userspace to set this.  How is that generally
decided?

>         kvm_pmu_init(vcpu);
>
>         vcpu->arch.pending_external_vector = -1;
>
> >       kvm_pmu_refresh(vcpu);
> >       vcpu->arch.cr4_guest_rsvd_bits =
> >           __cr4_reserved_bits(guest_cpuid_has, vcpu);
> > --
> > 2.43.0.429.g432eaa2c6b-goog
> >
Sean Christopherson Jan. 24, 2024, 9:25 p.m. UTC | #3
On Wed, Jan 24, 2024, Aaron Lewis wrote:
> On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> > > Without this, there is an issue in live migration. In particular, to
> > > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> > > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> > > the target is unable to set the value. This creates confusions on the user
> > > side.
> > >
> > > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> > > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> > > wrong on the kvm_get_msr_common() which just reads
> > > vcpu->arch.perf_capabilities.
> > >
> > > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> > > early in VM setup time.
> > >
> > > Cc: Aaron Lewis <aaronlewis@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/cpuid.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > index adba49afb5fe..416bee03c42a 100644
> > > --- a/arch/x86/kvm/cpuid.c
> > > +++ b/arch/x86/kvm/cpuid.c
> > > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >       vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> > >       vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> > >
> > > +     /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> > > +     if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > > +             vcpu->arch.perf_capabilities = 0;
> >
> > No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> > vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> > actually does something like that, but KVM overwriting guest state is almost
> > never a good thing.
> >
> > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> > already fudging around this in our internal kernels, so I don't think there's a
> > need to carry a hack-a-fix for the destination kernel.
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 27e23714e960..fdef9d706d61 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         kvm_async_pf_hash_reset(vcpu);
> >
> > -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> 
> Yeah, that will fix the issue we are seeing.  The only thing that's
> not clear to me is if userspace should expect KVM to set this or if
> KVM should expect userspace to set this.  How is that generally
> decided?

By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
To be consistent with KVM's CPUID module at vCPU creation, which is completely
empty (vCPU has no PMU and no PDCM support) KVM *must* zero
vcpu->arch.perf_capabilities.

If userspace wants a non-zero value, then userspace needs to set CPUID to enable
PDCM and set MSR_IA32_PERF_CAPABILITIES.

MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value.  That too
should be excised.

In a perfect world, KVM would also zero-initialize vcpu->arch.msr_platform_info,
but that one is less obviously broken and also less obviously safe to remove.

  commit e53d88af63ab4104e1226b8f9959f1e9903da10b
  Author:     Jim Mattson <jmattson@google.com>
  AuthorDate: Tue Oct 30 12:20:21 2018 -0700
  Commit:     Paolo Bonzini <pbonzini@redhat.com>
  CommitDate: Fri Dec 14 18:00:01 2018 +0100

      kvm: x86: Don't modify MSR_PLATFORM_INFO on vCPU reset
    
      If userspace has provided a different value for this MSR (e.g with the
      turbo bits set), the userspace-provided value should survive a vCPU
      reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized
      in kvm_arch_vcpu_setup.
    
      Signed-off-by: Jim Mattson <jmattson@google.com>
      Reviewed-by: Drew Schmitt <dasch@google.com>
      Cc: Abhiroop Dabral <adabral@paloaltonetworks.com>
      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

In other words, KVM shouldn't define the vCPU model beyond the absolute bare
minimum that is required by the x86 architecture (as of P6 CPUs, which is more
or less the oldest CPU KVM can reasonably virtualize without carrying useless code).
Mingwei Zhang Jan. 24, 2024, 10:24 p.m. UTC | #4
On Wed, Jan 24, 2024, Sean Christopherson wrote:
> On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> > > > Without this, there is an issue in live migration. In particular, to
> > > > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> > > > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> > > > the target is unable to set the value. This creates confusions on the user
> > > > side.
> > > >
> > > > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> > > > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> > > > wrong on the kvm_get_msr_common() which just reads
> > > > vcpu->arch.perf_capabilities.
> > > >
> > > > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> > > > early in VM setup time.
> > > >
> > > > Cc: Aaron Lewis <aaronlewis@google.com>
> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > > ---
> > > >  arch/x86/kvm/cpuid.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > index adba49afb5fe..416bee03c42a 100644
> > > > --- a/arch/x86/kvm/cpuid.c
> > > > +++ b/arch/x86/kvm/cpuid.c
> > > > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > >       vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> > > >       vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> > > >
> > > > +     /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> > > > +     if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > > > +             vcpu->arch.perf_capabilities = 0;
> > >
> > > No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> > > vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> > > actually does something like that, but KVM overwriting guest state is almost
> > > never a good thing.
> > >
> > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> > > already fudging around this in our internal kernels, so I don't think there's a
> > > need to carry a hack-a-fix for the destination kernel.
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 27e23714e960..fdef9d706d61 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >
> > >         kvm_async_pf_hash_reset(vcpu);
> > >
> > > -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > 
> > Yeah, that will fix the issue we are seeing.  The only thing that's
> > not clear to me is if userspace should expect KVM to set this or if
> > KVM should expect userspace to set this.  How is that generally
> > decided?
> 
> By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> To be consistent with KVM's CPUID module at vCPU creation, which is completely
> empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> vcpu->arch.perf_capabilities.
> 
> If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> PDCM and set MSR_IA32_PERF_CAPABILITIES.
> 
> MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value.  That too
> should be excised.
> 
hmm, does that mean KVM just allows an invalid vcpu state exist from
host point of view? I think this makes a lot of confusions on migration
where VMM on the source believes that a non-zero value from KVM_GET_MSRS
is valid and the VMM on the target will find it not true.

If we follow the suggestion by removing the initial value at vCPU
creation time, then I think it breaks the existing VMM code, since that
requires VMM to explicitly set the MSR, which I am not sure we do today.

The following code below is different. The key difference is that the
following code preserves a valid value, but this case is to not preserve
an invalid value. 

Thanks.
-Mingwei

> In a perfect world, KVM would also zero-initialize vcpu->arch.msr_platform_info,
> but that one is less obviously broken and also less obviously safe to remove.
> 
>   commit e53d88af63ab4104e1226b8f9959f1e9903da10b
>   Author:     Jim Mattson <jmattson@google.com>
>   AuthorDate: Tue Oct 30 12:20:21 2018 -0700
>   Commit:     Paolo Bonzini <pbonzini@redhat.com>
>   CommitDate: Fri Dec 14 18:00:01 2018 +0100
> 
>       kvm: x86: Don't modify MSR_PLATFORM_INFO on vCPU reset
>     
>       If userspace has provided a different value for this MSR (e.g with the
>       turbo bits set), the userspace-provided value should survive a vCPU
>       reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized
>       in kvm_arch_vcpu_setup.
>     
>       Signed-off-by: Jim Mattson <jmattson@google.com>
>       Reviewed-by: Drew Schmitt <dasch@google.com>
>       Cc: Abhiroop Dabral <adabral@paloaltonetworks.com>
>       Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> In other words, KVM shouldn't define the vCPU model beyond the absolute bare
> minimum that is required by the x86 architecture (as of P6 CPUs, which is more
> or less the oldest CPU KVM can reasonably virtualize without carrying useless code).
Sean Christopherson Jan. 24, 2024, 10:51 p.m. UTC | #5
On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> > > > vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > > host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> > > > actually does something like that, but KVM overwriting guest state is almost
> > > > never a good thing.
> > > >
> > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> > > > already fudging around this in our internal kernels, so I don't think there's a
> > > > need to carry a hack-a-fix for the destination kernel.
> > > >
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 27e23714e960..fdef9d706d61 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > >
> > > >         kvm_async_pf_hash_reset(vcpu);
> > > >
> > > > -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > > 
> > > Yeah, that will fix the issue we are seeing.  The only thing that's
> > > not clear to me is if userspace should expect KVM to set this or if
> > > KVM should expect userspace to set this.  How is that generally
> > > decided?
> > 
> > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> > To be consistent with KVM's CPUID module at vCPU creation, which is completely
> > empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> > vcpu->arch.perf_capabilities.
> > 
> > If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> > PDCM and set MSR_IA32_PERF_CAPABILITIES.
> > 
> > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value.  That too
> > should be excised.
> > 
> hmm, does that mean KVM just allows an invalid vcpu state exist from
> host point of view?

Yes.

https://lore.kernel.org/all/ZC4qF90l77m3X1Ir@google.com

> I think this makes a lot of confusions on migration where VMM on the source
> believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> target will find it not true.

Yes, but seeing a non-zero value is a KVM bug that should be fixed.

> If we follow the suggestion by removing the initial value at vCPU
> creation time, then I think it breaks the existing VMM code, since that
> requires VMM to explicitly set the MSR, which I am not sure we do today.

Yeah, I'm hoping we can squeak by without breaking existing setups.

I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").

Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
KVM to define the vCPU is irresponsible, to put it nicely.

> The following code below is different. The key difference is that the
> following code preserves a valid value, but this case is to not preserve
> an invalid value. 

But it's a completely different fix.  I referenced that commit to call out that
the need for the commit and changelog suggests that someone (*cough* us) is relying
on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time.
That doesn't mean it's the correct KVM behavior, just that it's much riskier to
change.
Mingwei Zhang Jan. 25, 2024, 12:14 a.m. UTC | #6
On Wed, Jan 24, 2024, Sean Christopherson wrote:
> On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > > No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> > > > > vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> > > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > > > host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> > > > > actually does something like that, but KVM overwriting guest state is almost
> > > > > never a good thing.
> > > > >
> > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> > > > > already fudging around this in our internal kernels, so I don't think there's a
> > > > > need to carry a hack-a-fix for the destination kernel.
> > > > >
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 27e23714e960..fdef9d706d61 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > > >
> > > > >         kvm_async_pf_hash_reset(vcpu);
> > > > >
> > > > > -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > > > 
> > > > Yeah, that will fix the issue we are seeing.  The only thing that's
> > > > not clear to me is if userspace should expect KVM to set this or if
> > > > KVM should expect userspace to set this.  How is that generally
> > > > decided?
> > > 
> > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> > > To be consistent with KVM's CPUID module at vCPU creation, which is completely
> > > empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> > > vcpu->arch.perf_capabilities.
> > > 
> > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> > > PDCM and set MSR_IA32_PERF_CAPABILITIES.
> > > 
> > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> > > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value.  That too
> > > should be excised.
> > > 
> > hmm, does that mean KVM just allows an invalid vcpu state exist from
> > host point of view?
> 
> Yes.
> 
> https://lore.kernel.org/all/ZC4qF90l77m3X1Ir@google.com
> 
> > I think this makes a lot of confusions on migration where VMM on the source
> > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > target will find it not true.
> 
> Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> 
How about adding an entry in vmx_get_msr() for
MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.

Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
the initial permitted value.

The benefit is that it is not enforcing the VMM to explicitly set the
value. In fact, there are several platform MSRs which has initial value
that VMM may rely on instead of explicitly setting.
MSR_IA32_PERF_CAPABILITIES is only one of them.


> > If we follow the suggestion by removing the initial value at vCPU
> > creation time, then I think it breaks the existing VMM code, since that
> > requires VMM to explicitly set the MSR, which I am not sure we do today.
> 
> Yeah, I'm hoping we can squeak by without breaking existing setups.
> 
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> 
> Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.
> 
> > The following code below is different. The key difference is that the
> > following code preserves a valid value, but this case is to not preserve
> > an invalid value. 
> 
> But it's a completely different fix.  I referenced that commit to call out that
> the need for the commit and changelog suggests that someone (*cough* us) is relying
> on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time.
> That doesn't mean it's the correct KVM behavior, just that it's much riskier to
> change.
Sean Christopherson Jan. 26, 2024, 6:33 p.m. UTC | #7
On Thu, Jan 25, 2024, Mingwei Zhang wrote:
> On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > I think this makes a lot of confusions on migration where VMM on the source
> > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > > target will find it not true.
> > 
> > Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> > 
> How about adding an entry in vmx_get_msr() for
> MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
> with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.
> Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
> the initial permitted value.

Hrm, I don't hate it as a stopgap.  But if we are the only people that are affected,
because again I'm pretty sure QEMU is fine, I would rather we just fix things in
our VMM and/or internal kernel.

Long term, I want some form of fix for the initialization code, even if that means
adding a quirk to let userspace opt out of KVM setting default values for platform
MSRs.

Side topic, vmx_set_msr() should check X86_FEATURE_PDCM, not just the PMU version.

> The benefit is that it is not enforcing the VMM to explicitly set the
> value. In fact, there are several platform MSRs which has initial value
> that VMM may rely on instead of explicitly setting.
> MSR_IA32_PERF_CAPABILITIES is only one of them.

Yeah, and all of those are broken.  AFAICT, the bad behavior got introduced for
MSR_PLATFORM_INFO, and then people kept copy+pasting that broken pattern :-(
Mingwei Zhang Jan. 26, 2024, 7:30 p.m. UTC | #8
+Frederick Mayle +Steven Moreland

On Fri, Jan 26, 2024 at 10:33 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 25, 2024, Mingwei Zhang wrote:
> > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > I think this makes a lot of confusions on migration where VMM on the source
> > > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > > > target will find it not true.
> > >
> > > Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> > >
> > How about adding an entry in vmx_get_msr() for
> > MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
> > with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.
> > Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
> > the initial permitted value.
>
> Hrm, I don't hate it as a stopgap.  But if we are the only people that are affected,
> because again I'm pretty sure QEMU is fine, I would rather we just fix things in
> our VMM and/or internal kernel.

It is not just QEMU. crossvm is another open source VMM that suffers
from this one.

>
> Long term, I want some form of fix for the initialization code, even if that means
> adding a quirk to let userspace opt out of KVM setting default values for platform
> MSRs.

Yeah, that I definitely agree. It applies to all future platform MSRs.
Potentially, any patch trying to add a new platform MSR and initialize
it at vcpu create time should be rejected. And VMM should be told to
get the feature value and set it appropriately.

For existing platform MSRs, I think we should prioritize the "fix"
solution. Refactoring existing code that requires ABI changes may
raise complaints from production folks.

>
> Side topic, vmx_set_msr() should check X86_FEATURE_PDCM, not just the PMU version.

You are right.
>
> > The benefit is that it is not enforcing the VMM to explicitly set the
> > value. In fact, there are several platform MSRs which has initial value
> > that VMM may rely on instead of explicitly setting.
> > MSR_IA32_PERF_CAPABILITIES is only one of them.
>
> Yeah, and all of those are broken.  AFAICT, the bad behavior got introduced for
> MSR_PLATFORM_INFO, and then people kept copy+pasting that broken pattern :-(
Sean Christopherson Jan. 26, 2024, 7:34 p.m. UTC | #9
On Fri, Jan 26, 2024, Mingwei Zhang wrote:
> +Frederick Mayle +Steven Moreland
> 
> On Fri, Jan 26, 2024 at 10:33 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 25, 2024, Mingwei Zhang wrote:
> > > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > > I think this makes a lot of confusions on migration where VMM on the source
> > > > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > > > > target will find it not true.
> > > >
> > > > Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> > > >
> > > How about adding an entry in vmx_get_msr() for
> > > MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
> > > with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.
> > > Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
> > > the initial permitted value.
> >
> > Hrm, I don't hate it as a stopgap.  But if we are the only people that are affected,
> > because again I'm pretty sure QEMU is fine, I would rather we just fix things in
> > our VMM and/or internal kernel.
> 
> It is not just QEMU. crossvm is another open source VMM that suffers
> from this one.

Does CrosVM support migration or some other form of save/restore (RR?)?  And if
so, does CrosVM do that in conjunction with hiding the vPMU from the guest?

Because if not, then I think we can squeak by.
Paolo Bonzini Jan. 29, 2024, 2:39 p.m. UTC | #10
On 1/24/24 23:51, Sean Christopherson wrote:
>> If we follow the suggestion by removing the initial value at vCPU
>> creation time, then I think it breaks the existing VMM code, since that
>> requires VMM to explicitly set the MSR, which I am not sure we do today.
> Yeah, I'm hoping we can squeak by without breaking existing setups.
> 
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> 
> Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.

Yes, I tend to agree.

What QEMU does goes from the squeaky clean to the very debatable 
depending on the parameters you give it.

With "-cpu Haswell" and similar, it will provide values for all CPUID 
and MSR bits that match as much as possible values from an actual CPU 
model.  It will complain if there are some values that do not match[1].

With "-cpu host", it will copy values from KVM_GET_SUPPORTED_CPUID and 
from the feature MSRs, but only for features that it knows about.

With "-cpu host,migratable=no", it will copy values from 
KVM_GET_SUPPORTED_CPUID and from the feature MSRs, but only for *feature 
words* (CPUID registers, or MSRs) that it knows about.  This is where it 
becomes debatable, because a CPUID bit could be added without QEMU 
knowing the corresponding MSR.  In this case, the user probably expects 
the MSR to have a nonzero.  On one hand I agree that it would be 
irresponsible, on the other hand that's the point of "-cpu 
host,migratable=no".

If you want to proceed with the change, I don't have any problem with 
considering it a QEMU bug that it doesn't copy over to the guest any 
unknown leaves or MSRs.

Paolo

[1] Unfortunately it's not fatal because there are way way too many 
models, and also because until recently TCG lacked AVX---and therefore 
could only emulate completely some very old CPU models.  But with "-cpu 
Haswell,enforce" then everything's clean.
Paolo Bonzini Jan. 29, 2024, 2:40 p.m. UTC | #11
On 1/26/24 20:30, Mingwei Zhang wrote:
>> Hrm, I don't hate it as a stopgap.  But if we are the only people that are affected,
>> because again I'm pretty sure QEMU is fine, I would rather we just fix things in
>> our VMM and/or internal kernel.
> It is not just QEMU. crossvm is another open source VMM that suffers
> from this one.

Can you explain the symptoms in both Google's internal VMM and crosvm?

Thanks,

Paolo
Mingwei Zhang Jan. 31, 2024, 7:43 p.m. UTC | #12
On Mon, Jan 29, 2024, Paolo Bonzini wrote:
> On 1/24/24 23:51, Sean Christopherson wrote:
> > > If we follow the suggestion by removing the initial value at vCPU
> > > creation time, then I think it breaks the existing VMM code, since that
> > > requires VMM to explicitly set the MSR, which I am not sure we do today.
> > Yeah, I'm hoping we can squeak by without breaking existing setups.
> > 
> > I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> > since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> > ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> > 
> > Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
> > KVM to define the vCPU is irresponsible, to put it nicely.
> 
> Yes, I tend to agree.

Discussed with Sean offline. Yes, I also agree that this should be
handled at VMM level. MSR_IA32_PERF_CAPABILITIES should be regarded as
part of the CPUID, or sort of. The diff is that its own
"KVM_GET_SUPPORTED_CPUID" (ie., the default value) should come from
KVM_GET_MSRS of the device ioctl.

Providing the default value for MSR_IA32_PERF_CAPABILITIES is really
making things messed. KVM has to always guard access to the cached guest
value with the checking of X86_FEATURE_PDCM. I believe
guest_cpuid_has(vcpu, X86_FEATURE_PDCM) will take runtime cost.

> 
> What QEMU does goes from the squeaky clean to the very debatable depending
> on the parameters you give it.
> 
> With "-cpu Haswell" and similar, it will provide values for all CPUID and
> MSR bits that match as much as possible values from an actual CPU model.  It
> will complain if there are some values that do not match[1].
> 
> With "-cpu host", it will copy values from KVM_GET_SUPPORTED_CPUID and from
> the feature MSRs, but only for features that it knows about.
> 
> With "-cpu host,migratable=no", it will copy values from
> KVM_GET_SUPPORTED_CPUID and from the feature MSRs, but only for *feature
> words* (CPUID registers, or MSRs) that it knows about.  This is where it
> becomes debatable, because a CPUID bit could be added without QEMU knowing
> the corresponding MSR.  In this case, the user probably expects the MSR to
> have a nonzero.  On one hand I agree that it would be irresponsible, on the
> other hand that's the point of "-cpu host,migratable=no".
> 
> If you want to proceed with the change, I don't have any problem with
> considering it a QEMU bug that it doesn't copy over to the guest any unknown
> leaves or MSRs.
> 
reply from another thread: CrosVM issue is not related to this one. It
might have something to do with KVM_GET_MSR_INDEX_LIST. I will come up
details later.
> Paolo
> 
> [1] Unfortunately it's not fatal because there are way way too many models,
> and also because until recently TCG lacked AVX---and therefore could only
> emulate completely some very old CPU models.  But with "-cpu
> Haswell,enforce" then everything's clean.
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..416bee03c42a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -369,6 +369,9 @@  static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
+	/* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
+	if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+		vcpu->arch.perf_capabilities = 0;
 	kvm_pmu_refresh(vcpu);
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);