diff mbox series

[07/54] KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken

Message ID 20210622175739.3610207-8-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Bug fixes and summer cleaning | expand

Commit Message

Sean Christopherson June 22, 2021, 5:56 p.m. UTC
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.

KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held).  And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.

To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).

Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.

In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst  | 11 ++++++++---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 18 ++++++++++++++++++
 arch/x86/kvm/x86.c              |  2 ++
 4 files changed, 29 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini June 23, 2021, 2:16 p.m. UTC | #1
On 22/06/21 19:56, Sean Christopherson wrote:
> +	/*
> +	 * KVM does not correctly handle changing guest CPUID after KVM_RUN, as
> +	 * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't
> +	 * tracked in kvm_mmu_page_role.  As a result, KVM may miss guest page
> +	 * faults due to reusing SPs/SPTEs.  Alert userspace, but otherwise
> +	 * sweep the problem under the rug.
> +	 *
> +	 * KVM's horrific CPUID ABI makes the problem all but impossible to
> +	 * solve, as correctly handling multiple vCPU models (with respect to
> +	 * paging and physical address properties) in a single VM would require
> +	 * tracking all relevant CPUID information in kvm_mmu_page_role.  That
> +	 * is very undesirable as it would double the memory requirements for
> +	 * gfn_track (see struct kvm_mmu_page_role comments), and in practice
> +	 * no sane VMM mucks with the core vCPU model on the fly.
> +	 */
> +	if (vcpu->arch.last_vmentry_cpu != -1)
> +		pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n");

Let's make this even stronger and promise to break it in 5.16.

Paolo
Jim Mattson June 23, 2021, 5 p.m. UTC | #2
On Wed, Jun 23, 2021 at 7:16 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 22/06/21 19:56, Sean Christopherson wrote:
> > +     /*
> > +      * KVM does not correctly handle changing guest CPUID after KVM_RUN, as
> > +      * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't
> > +      * tracked in kvm_mmu_page_role.  As a result, KVM may miss guest page
> > +      * faults due to reusing SPs/SPTEs.  Alert userspace, but otherwise
> > +      * sweep the problem under the rug.
> > +      *
> > +      * KVM's horrific CPUID ABI makes the problem all but impossible to
> > +      * solve, as correctly handling multiple vCPU models (with respect to
> > +      * paging and physical address properties) in a single VM would require
> > +      * tracking all relevant CPUID information in kvm_mmu_page_role.  That
> > +      * is very undesirable as it would double the memory requirements for
> > +      * gfn_track (see struct kvm_mmu_page_role comments), and in practice
> > +      * no sane VMM mucks with the core vCPU model on the fly.
> > +      */
> > +     if (vcpu->arch.last_vmentry_cpu != -1)
> > +             pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n");
>
> Let's make this even stronger and promise to break it in 5.16.
>
> Paolo

Doesn't this fall squarely into kvm's philosophy of "we should let
userspace shoot itself in the foot wherever possible"? I thought we
only stepped in when host stability was an issue.

I'm actually delighted if this is a sign that we're rethinking that
philosophy. I'd just like to hear someone say it.
Paolo Bonzini June 23, 2021, 5:11 p.m. UTC | #3
On 23/06/21 19:00, Jim Mattson wrote:
> On Wed, Jun 23, 2021 at 7:16 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 22/06/21 19:56, Sean Christopherson wrote:
>>> +     /*
>>> +      * KVM does not correctly handle changing guest CPUID after KVM_RUN, as
>>> +      * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't
>>> +      * tracked in kvm_mmu_page_role.  As a result, KVM may miss guest page
>>> +      * faults due to reusing SPs/SPTEs.  Alert userspace, but otherwise
>>> +      * sweep the problem under the rug.
>>> +      *
>>> +      * KVM's horrific CPUID ABI makes the problem all but impossible to
>>> +      * solve, as correctly handling multiple vCPU models (with respect to
>>> +      * paging and physical address properties) in a single VM would require
>>> +      * tracking all relevant CPUID information in kvm_mmu_page_role.  That
>>> +      * is very undesirable as it would double the memory requirements for
>>> +      * gfn_track (see struct kvm_mmu_page_role comments), and in practice
>>> +      * no sane VMM mucks with the core vCPU model on the fly.
>>> +      */
>>> +     if (vcpu->arch.last_vmentry_cpu != -1)
>>> +             pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n");
>>
>> Let's make this even stronger and promise to break it in 5.16.
>>
>> Paolo
> 
> Doesn't this fall squarely into kvm's philosophy of "we should let
> userspace shoot itself in the foot wherever possible"? I thought we
> only stepped in when host stability was an issue.
> 
> I'm actually delighted if this is a sign that we're rethinking that
> philosophy. I'd just like to hear someone say it.

Nah, that's not the philosophy.  The philosophy is that covering all 
possible ways for userspace to shoot itself in the foot is impossible.

However, here we're talking about 2 lines of code (thanks also to your 
patches that add last_vmentry_cpu for completely unrelated reasons) to 
remove a whole set of bullet/foot encounters.

Paolo
Jim Mattson June 23, 2021, 6:11 p.m. UTC | #4
On Wed, Jun 23, 2021 at 10:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 23/06/21 19:00, Jim Mattson wrote:
> > On Wed, Jun 23, 2021 at 7:16 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >> On 22/06/21 19:56, Sean Christopherson wrote:
> >>> +     /*
> >>> +      * KVM does not correctly handle changing guest CPUID after KVM_RUN, as
> >>> +      * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't
> >>> +      * tracked in kvm_mmu_page_role.  As a result, KVM may miss guest page
> >>> +      * faults due to reusing SPs/SPTEs.  Alert userspace, but otherwise
> >>> +      * sweep the problem under the rug.
> >>> +      *
> >>> +      * KVM's horrific CPUID ABI makes the problem all but impossible to
> >>> +      * solve, as correctly handling multiple vCPU models (with respect to
> >>> +      * paging and physical address properties) in a single VM would require
> >>> +      * tracking all relevant CPUID information in kvm_mmu_page_role.  That
> >>> +      * is very undesirable as it would double the memory requirements for
> >>> +      * gfn_track (see struct kvm_mmu_page_role comments), and in practice
> >>> +      * no sane VMM mucks with the core vCPU model on the fly.
> >>> +      */
> >>> +     if (vcpu->arch.last_vmentry_cpu != -1)
> >>> +             pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n");
> >>
> >> Let's make this even stronger and promise to break it in 5.16.
> >>
> >> Paolo
> >
> > Doesn't this fall squarely into kvm's philosophy of "we should let
> > userspace shoot itself in the foot wherever possible"? I thought we
> > only stepped in when host stability was an issue.
> >
> > I'm actually delighted if this is a sign that we're rethinking that
> > philosophy. I'd just like to hear someone say it.
>
> Nah, that's not the philosophy.  The philosophy is that covering all
> possible ways for userspace to shoot itself in the foot is impossible.
>
> However, here we're talking about 2 lines of code (thanks also to your
> patches that add last_vmentry_cpu for completely unrelated reasons) to
> remove a whole set of bullet/foot encounters.

What about the problems that arise when we have different CPUID tables
for different vCPUs in the same VM? Can we just replace this
hole-in-foot inducing ioctl with a KVM_VM_SET_CPUID ioctl on the VM
level that has to be called before any vCPUs are created?
Paolo Bonzini June 23, 2021, 6:49 p.m. UTC | #5
On 23/06/21 20:11, Jim Mattson wrote:
> On Wed, Jun 23, 2021 at 10:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Nah, that's not the philosophy.  The philosophy is that covering all
>> possible ways for userspace to shoot itself in the foot is impossible.
>>
>> However, here we're talking about 2 lines of code (thanks also to your
>> patches that add last_vmentry_cpu for completely unrelated reasons) to
>> remove a whole set of bullet/foot encounters.
> 
> What about the problems that arise when we have different CPUID tables
> for different vCPUs in the same VM? Can we just replace this
> hole-in-foot inducing ioctl with a KVM_VM_SET_CPUID ioctl on the VM
> level that has to be called before any vCPUs are created?

Are there any KVM bugs that this can fix?  The problem is that, unlike 
this case, it would be effectively impossible to deprecate 
KVM_SET_CPUID2 as a vcpu ioctl, so it would be hard to reap any benefits 
in KVM.

BTW, there is actually a theoretical usecase for KVM_SET_CPUID2 after 
KVM_RUN, which is to test OSes against microcode updates that hide, 
totally random example, the RTM bit.  But it's still not worth keeping 
it given 1) the bugs and complications in KVM, 2) if you really wanted 
that kind of testing so hard, the fact that you can just create a new 
vcpu file descriptor from scratch, possibly in cooperation with 
userspace MSR filtering 3) AFAIK no one has done that anyway in 15 years.

Paolo
Jim Mattson June 23, 2021, 7:02 p.m. UTC | #6
On Wed, Jun 23, 2021 at 11:49 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 23/06/21 20:11, Jim Mattson wrote:
> > On Wed, Jun 23, 2021 at 10:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> Nah, that's not the philosophy.  The philosophy is that covering all
> >> possible ways for userspace to shoot itself in the foot is impossible.
> >>
> >> However, here we're talking about 2 lines of code (thanks also to your
> >> patches that add last_vmentry_cpu for completely unrelated reasons) to
> >> remove a whole set of bullet/foot encounters.
> >
> > What about the problems that arise when we have different CPUID tables
> > for different vCPUs in the same VM? Can we just replace this
> > hole-in-foot inducing ioctl with a KVM_VM_SET_CPUID ioctl on the VM
> > level that has to be called before any vCPUs are created?
>
> Are there any KVM bugs that this can fix?  The problem is that, unlike
> this case, it would be effectively impossible to deprecate
> KVM_SET_CPUID2 as a vcpu ioctl, so it would be hard to reap any benefits
> in KVM.
>
> BTW, there is actually a theoretical usecase for KVM_SET_CPUID2 after
> KVM_RUN, which is to test OSes against microcode updates that hide,
> totally random example, the RTM bit.  But it's still not worth keeping
> it given 1) the bugs and complications in KVM, 2) if you really wanted
> that kind of testing so hard, the fact that you can just create a new
> vcpu file descriptor from scratch, possibly in cooperation with
> userspace MSR filtering 3) AFAIK no one has done that anyway in 15 years.

Though such a usecase may exist, I don't think it actually works
today. For example, kvm_vcpu_after_set_cpuid() potentially changes the
value of the guest IA32_PERF_GLOBAL_CTRL MSR.
Paolo Bonzini June 23, 2021, 7:53 p.m. UTC | #7
On 23/06/21 21:02, Jim Mattson wrote:
>>
>> BTW, there is actually a theoretical usecase for KVM_SET_CPUID2 after
>> KVM_RUN, which is to test OSes against microcode updates that hide,
>> totally random example, the RTM bit.  But it's still not worth keeping
>> it given 1) the bugs and complications in KVM, 2) if you really wanted
>> that kind of testing so hard, the fact that you can just create a new
>> vcpu file descriptor from scratch, possibly in cooperation with
>> userspace MSR filtering 3) AFAIK no one has done that anyway in 15 years.
>
> Though such a usecase may exist, I don't think it actually works
> today. For example, kvm_vcpu_after_set_cpuid() potentially changes the
> value of the guest IA32_PERF_GLOBAL_CTRL MSR.

Yep, and that's why I'm okay with actively deprecating KVM_SET_CPUID2 
and not just "discouraging" it.

Paolo
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e328caa35d6c..06e82f07fe54 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -688,9 +688,14 @@  MSRs that have been set successfully.
 Defines the vcpu responses to the cpuid instruction.  Applications
 should use the KVM_SET_CPUID2 ioctl if available.
 
-Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
-configuration (if there is) is not corrupted. Userspace can get a copy of the
-resulting CPUID configuration through KVM_GET_CPUID2 in case.
+Caveat emptor:
+  - If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
+    configuration (if there is) is not corrupted. Userspace can get a copy
+    of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
+  - Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
+    after running the guest, may cause guest instability.
+  - Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
+    may cause guest instability.
 
 ::
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4ac534766eff..19c88b445ee0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -840,7 +840,7 @@  struct kvm_vcpu_arch {
 	bool l1tf_flush_l1d;
 
 	/* Host CPU on which VM-entry was most recently attempted */
-	unsigned int last_vmentry_cpu;
+	int last_vmentry_cpu;
 
 	/* AMD MSRC001_0015 Hardware Configuration */
 	u64 msr_hwcr;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2668a9b5936..8d97d21d5241 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4913,6 +4913,24 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
 	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
 	kvm_mmu_reset_context(vcpu);
+
+	/*
+	 * KVM does not correctly handle changing guest CPUID after KVM_RUN, as
+	 * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't
+	 * tracked in kvm_mmu_page_role.  As a result, KVM may miss guest page
+	 * faults due to reusing SPs/SPTEs.  Alert userspace, but otherwise
+	 * sweep the problem under the rug.
+	 *
+	 * KVM's horrific CPUID ABI makes the problem all but impossible to
+	 * solve, as correctly handling multiple vCPU models (with respect to
+	 * paging and physical address properties) in a single VM would require
+	 * tracking all relevant CPUID information in kvm_mmu_page_role.  That
+	 * is very undesirable as it would double the memory requirements for
+	 * gfn_track (see struct kvm_mmu_page_role comments), and in practice
+	 * no sane VMM mucks with the core vCPU model on the fly.
+	 */
+	if (vcpu->arch.last_vmentry_cpu != -1)
+		pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n");
 }
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42608b515ce4..92b4a9305651 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10583,6 +10583,8 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	struct page *page;
 	int r;
 
+	vcpu->arch.last_vmentry_cpu = -1;
+
 	if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
 		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 	else