diff mbox series

[06/43] KVM: x86: Properly reset MMU context at vCPU RESET/INIT

Message ID 20210424004645.3950558-7-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: vCPU RESET/INIT fixes and consolidation | expand

Commit Message

Sean Christopherson April 24, 2021, 12:46 a.m. UTC
Post-process the CR0 and CR4 changes at vCPU INIT (and RESET for good
measure) to effect a MMU context reset when necessary.  Simply
re-initializing the current MMU is not sufficient as the current root
HPA may not be usable in the new context.  E.g. if TDP is disabled and
INIT arrives while the vCPU is in long mode, KVM will fail to switch to
the 32-bit pae_root and bomb on the next VM-Enter due to running with a
64-bit CR3 in 32-bit mode.

This bug was papered over in both VMX and SVM.

In VMX, the INIT issue is specific to running without unrestricted guest
since unrestricted guest is available if and only if EPT is enabled.
Commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real
mode") resolved the issue by forcing a reset when entering emulated real
mode.

In SVM, commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset") forced
a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
the time the bug was fixed, the SVM problem was exacerbated by a complete
lack of a CR4 update.

The VMX and SVM fixes are not technically wrong, but lack of precision
makes it difficult to reason about why a context reset is needed.  The VMX
code in particular is nasty. The vendor resets will be reverted in future
patches, primarily to aid bisection in case there are non-INIT flows that
rely on the existing VMX logic.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Reiji Watanabe May 17, 2021, 4:57 p.m. UTC | #1
>  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  {
> +       unsigned long old_cr0 = kvm_read_cr0(vcpu);
> +       unsigned long old_cr4 = kvm_read_cr4(vcpu);
> +
>         kvm_lapic_reset(vcpu, init_event);
>
>         vcpu->arch.hflags = 0;
> @@ -10483,6 +10485,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         vcpu->arch.ia32_xss = 0;
>
>         static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
> +
> +       if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> +           kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> +               kvm_mmu_reset_context(vcpu);
>  }

I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
for a change in EFER.NX as well.

Thanks,
Reiji
Sean Christopherson May 18, 2021, 8:23 p.m. UTC | #2
On Mon, May 17, 2021, Reiji Watanabe wrote:
> >  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  {
> > +       unsigned long old_cr0 = kvm_read_cr0(vcpu);
> > +       unsigned long old_cr4 = kvm_read_cr4(vcpu);
> > +
> >         kvm_lapic_reset(vcpu, init_event);
> >
> >         vcpu->arch.hflags = 0;
> > @@ -10483,6 +10485,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >         vcpu->arch.ia32_xss = 0;
> >
> >         static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
> > +
> > +       if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > +           kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > +               kvm_mmu_reset_context(vcpu);
> >  }
> 
> I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> for a change in EFER.NX as well.

Oooh.  So there _should_ be no need.   Paging has to be enabled for EFER.NX to
be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
guaranteed to trigger a context reset.  And we do want to skip the context reset,
e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
same MMU.

But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
However, that's benign from a functionality perspective because the context
itself correctly incorporates CR0.PG, it's only the role that's borked.  I.e.
KVM will fail to reuse a page/context due to the spurious role.nxe, but the
permission checks are always be correct.

I'll add a comment here and send a patch to fix the role calculation.
Reiji Watanabe May 18, 2021, 10:42 p.m. UTC | #3
> > > +       if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > +           kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > +               kvm_mmu_reset_context(vcpu);
> > >  }
> >
> > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > for a change in EFER.NX as well.
>
> Oooh.  So there _should_ be no need.   Paging has to be enabled for EFER.NX to
> be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> guaranteed to trigger a context reset.  And we do want to skip the context reset,
> e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> same MMU.
>
> But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> However, that's benign from a functionality perspective because the context
> itself correctly incorporates CR0.PG, it's only the role that's borked.  I.e.
> KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> permission checks are always be correct.
>
> I'll add a comment here and send a patch to fix the role calculation.

Thank you so much for the explanation !
I understand your intention and why it would be benign.

Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
called here.  Looking at the Intel SDM, in my understanding,
all the bits kvm_cr4_mmu_role_changed() checks are relevant
only if paging is enabled.  (Or is my understanding incorrect ??)

Thanks,
Reiji
Sean Christopherson May 19, 2021, 5:16 p.m. UTC | #4
On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > > +       if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > > +           kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > > +               kvm_mmu_reset_context(vcpu);
> > > >  }
> > >
> > > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > > for a change in EFER.NX as well.
> >
> > Oooh.  So there _should_ be no need.   Paging has to be enabled for EFER.NX to
> > be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> > guaranteed to trigger a context reset.  And we do want to skip the context reset,
> > e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> > same MMU.
> >
> > But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> > MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> > However, that's benign from a functionality perspective because the context
> > itself correctly incorporates CR0.PG, it's only the role that's borked.  I.e.
> > KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> > permission checks are always be correct.
> >
> > I'll add a comment here and send a patch to fix the role calculation.
> 
> Thank you so much for the explanation !
> I understand your intention and why it would be benign.
> 
> Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
> called here.  Looking at the Intel SDM, in my understanding,
> all the bits kvm_cr4_mmu_role_changed() checks are relevant
> only if paging is enabled.  (Or is my understanding incorrect ??)

Duh, yes.  And it goes even beyond that, CR0.WP is only relevant if CR0.PG=1,
i.e. INIT with CR0.PG=0 and CR0.WP=1 will incorrectly trigger a MMU reset with
the current logic.

Sadly, simply omitting the CR4 check puts us in an awkward situation where, due
to the MMU role CR4 calculations not accounting for CR0.PG=0, KVM will run with
a stale role.

The other consideration is that kvm_post_set_cr4() and kvm_post_set_cr0() should
also skip kvm_mmu_reset_context() if CR0.PG=0, but again that requires fixing
the role calculations first (or at the same time).

I think I'll throw in those cleanups to the beginning of this series.  The result
is going to be disgustingly long, but I really don't want to introduce code that
knowingly leaves KVM in an inconsistent state, nor do I want to add useless
checks on CR4 and EFER.
Reiji Watanabe May 24, 2021, 4:57 a.m. UTC | #5
On Wed, May 19, 2021 at 10:16 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > > > +       if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > > > +           kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > > > +               kvm_mmu_reset_context(vcpu);
> > > > >  }
> > > >
> > > > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > > > for a change in EFER.NX as well.
> > >
> > > Oooh.  So there _should_ be no need.   Paging has to be enabled for EFER.NX to
> > > be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> > > guaranteed to trigger a context reset.  And we do want to skip the context reset,
> > > e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> > > same MMU.
> > >
> > > But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> > > MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> > > However, that's benign from a functionality perspective because the context
> > > itself correctly incorporates CR0.PG, it's only the role that's borked.  I.e.
> > > KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> > > permission checks are always be correct.
> > >
> > > I'll add a comment here and send a patch to fix the role calculation.
> >
> > Thank you so much for the explanation !
> > I understand your intention and why it would be benign.
> >
> > Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
> > called here.  Looking at the Intel SDM, in my understanding,
> > all the bits kvm_cr4_mmu_role_changed() checks are relevant
> > only if paging is enabled.  (Or is my understanding incorrect ??)
>
> Duh, yes.  And it goes even beyond that, CR0.WP is only relevant if CR0.PG=1,
> i.e. INIT with CR0.PG=0 and CR0.WP=1 will incorrectly trigger a MMU reset with
> the current logic.
>
> Sadly, simply omitting the CR4 check puts us in an awkward situation where, due
> to the MMU role CR4 calculations not accounting for CR0.PG=0, KVM will run with
> a stale role.
>
> The other consideration is that kvm_post_set_cr4() and kvm_post_set_cr0() should
> also skip kvm_mmu_reset_context() if CR0.PG=0, but again that requires fixing
> the role calculations first (or at the same time).
>
> I think I'll throw in those cleanups to the beginning of this series.  The result
> is going to be disgustingly long, but I really don't want to introduce code that
> knowingly leaves KVM in an inconsistent state, nor do I want to add useless
> checks on CR4 and EFER.

Yes, I would think having the cleanups would be better.

Thank you !
Reiji
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0bc783fc6c9b..b87193190a73 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10341,7 +10341,6 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	kvm_vcpu_mtrr_init(vcpu);
 	vcpu_load(vcpu);
 	kvm_vcpu_reset(vcpu, false);
-	kvm_init_mmu(vcpu, false);
 	vcpu_put(vcpu);
 	return 0;
 
@@ -10415,6 +10414,9 @@  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
+	unsigned long old_cr0 = kvm_read_cr0(vcpu);
+	unsigned long old_cr4 = kvm_read_cr4(vcpu);
+
 	kvm_lapic_reset(vcpu, init_event);
 
 	vcpu->arch.hflags = 0;
@@ -10483,6 +10485,10 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->arch.ia32_xss = 0;
 
 	static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
+
+	if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
+	    kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
+		kvm_mmu_reset_context(vcpu);
 }
 
 void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)