diff mbox series

KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Message ID 20250116035008.43404-1-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit | expand

Commit Message

Yosry Ahmed Jan. 16, 2025, 3:50 a.m. UTC
nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
flush the TLB if VPID is enabled for both L1 and L2, but they still
share the TLB tag. This happens if EPT is disabled and KVM fails to
allocate a VPID for L2, so both the EPTP and VPID are shared between L1
and L2.

Interestingly, nested_vmx_transition_tlb_flush() uses
KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
flush is required.

Taking a close look at vmx_flush_tlb_guest() and
vmx_flush_tlb_current(), the main differences are:
(a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
(b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
INVVPID) to flush the guest-physical mappings as well as combined
mappings.

The check in (a) is seemingly an optimization, and there should not be
any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
check in vmx_flush_tlb_guest() is not a fundamental difference, and it
can be added there separately if needed.

The difference in (b) is irrelevant in this case, because EPT being
enabled for L1 means that its TLB tags are tagged with EPTP and cannot
be used by L2 (regardless of whether or not EPT is enabled for L2).

Use KVM_REQ_TLB_FLUSH_GUEST in this case in
nested_vmx_transition_tlb_flush() for consistency. This arguably makes
more sense conceptually too -- L1 and L2 cannot share the TLB tag for
guest-physical translations, so only flushing linear and combined
translations (i.e. guest-generated translations) is needed.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---

I tested this by running all selftests that have "nested" in their name
(and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I
convinced myself it's prompted by the change :)

---
 arch/x86/kvm/vmx/nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jim Mattson Jan. 16, 2025, 5:27 a.m. UTC | #1
On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
> flush the TLB if VPID is enabled for both L1 and L2, but they still
> share the TLB tag. This happens if EPT is disabled and KVM fails to
> allocate a VPID for L2, so both the EPTP and VPID are shared between L1
> and L2.

Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new
acronym for EP4TA), not EPTP. But, in any case, with EPT disabled,
there are no combined or guest-physical mappings. There are only
linear mappings.

> Interestingly, nested_vmx_transition_tlb_flush() uses
> KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
> flush is required.
>
> Taking a close look at vmx_flush_tlb_guest() and
> vmx_flush_tlb_current(), the main differences are:
> (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
> (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
> INVVPID) to flush the guest-physical mappings as well as combined
> mappings.
>
> The check in (a) is seemingly an optimization, and there should not be
> any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
> check in vmx_flush_tlb_guest() is not a fundamental difference, and it
> can be added there separately if needed.
>
> The difference in (b) is irrelevant in this case, because EPT being
> enabled for L1 means that its TLB tags are tagged with EPTP and cannot
> be used by L2 (regardless of whether or not EPT is enabled for L2).

The difference is also irrelevant because, as you concluded in the
first paragraph, EPT is disabled in the final block of
nested_vmx_transition_tlb_flush().

> Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> guest-physical translations, so only flushing linear and combined
> translations (i.e. guest-generated translations) is needed.

And, as I mentioned above, with EPT disabled, there are no combined or
guest-physical mappings.

> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

I think the reasoning in the commit message can be cleared up a bit, but...

Reviewed-by: Jim Mattson <mattson@google.com>

> ---
>
> I tested this by running all selftests that have "nested" in their name
> (and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I
> convinced myself it's prompted by the change :)
>
> ---
>  arch/x86/kvm/vmx/nested.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index aa78b6f38dfef..2ed454186e59c 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1241,7 +1241,7 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
>          * as the effective ASID is common to both L1 and L2.
>          */
>         if (!nested_has_guest_tlb_tag(vcpu))
> -               kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> +               kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  }
>
>  static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
> --
> 2.48.0.rc2.279.g1de40edade-goog
>
>
Yosry Ahmed Jan. 16, 2025, 3:25 p.m. UTC | #2
On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
> > flush the TLB if VPID is enabled for both L1 and L2, but they still
> > share the TLB tag. This happens if EPT is disabled and KVM fails to
> > allocate a VPID for L2, so both the EPTP and VPID are shared between L1
> > and L2.
>
> Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new
> acronym for EP4TA), not EPTP. But, in any case, with EPT disabled,
> there are no combined or guest-physical mappings. There are only
> linear mappings.

Interestingly, I did initially write EPTRTA, but I changed it to EPTP
because that is the terminology used in nested_has_guest_tlb_tag().
Anyway, I definitely don't mind changing it to EPTRTA.

>
> > Interestingly, nested_vmx_transition_tlb_flush() uses
> > KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
> > flush is required.
> >
> > Taking a close look at vmx_flush_tlb_guest() and
> > vmx_flush_tlb_current(), the main differences are:
> > (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
> > (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
> > INVVPID) to flush the guest-physical mappings as well as combined
> > mappings.
> >
> > The check in (a) is seemingly an optimization, and there should not be
> > any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
> > check in vmx_flush_tlb_guest() is not a fundamental difference, and it
> > can be added there separately if needed.
> >
> > The difference in (b) is irrelevant in this case, because EPT being
> > enabled for L1 means that its TLB tags are tagged with EPTP and cannot
> > be used by L2 (regardless of whether or not EPT is enabled for L2).
>
> The difference is also irrelevant because, as you concluded in the
> first paragraph, EPT is disabled in the final block of
> nested_vmx_transition_tlb_flush().

I was trying to explain that even if EPT is enabled, sharing
guest-physical translations between L1 and L2 should never be possible
(and hence we should never worry about flushing these translations in
nested_vmx_transition_tlb_flush()).  Now that I read it again it is
not as clear as I had hoped.

>
> > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > guest-physical translations, so only flushing linear and combined
> > translations (i.e. guest-generated translations) is needed.
>
> And, as I mentioned above, with EPT disabled, there are no combined or
> guest-physical mappings.
>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> I think the reasoning in the commit message can be cleared up a bit, but...

Agreed :) I am sure Sean will also want changes in the commit message anyway.

>
> Reviewed-by: Jim Mattson <mattson@google.com>

Thanks for the quick review!
Sean Christopherson Jan. 16, 2025, 5:11 p.m. UTC | #3
On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > guest-physical translations, so only flushing linear and combined
> > > translations (i.e. guest-generated translations) is needed.

No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
TLBs, and thus KVM is not required to FLUSH_GUEST.

E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
guaranteed to be visible to L2.

But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.

Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
is not required in this scenario.

  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
  {
	++vcpu->stat.tlb_flush;

	if (!tdp_enabled) {
		/*
		 * A TLB flush on behalf of the guest is equivalent to
		 * INVPCID(all), toggling CR4.PGE, etc., which requires
		 * a forced sync of the shadow page tables.  Ensure all the
		 * roots are synced and the guest TLB in hardware is clean.
		 */
		kvm_mmu_sync_roots(vcpu);
		kvm_mmu_sync_prev_roots(vcpu);
	}

	kvm_x86_call(flush_tlb_guest)(vcpu);

	/*
	 * Flushing all "guest" TLB is always a superset of Hyper-V's fine
	 * grained flushing.
	 */
	kvm_hv_vcpu_purge_flush_tlb(vcpu);
  }
Yosry Ahmed Jan. 16, 2025, 6:24 p.m. UTC | #4
On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > > guest-physical translations, so only flushing linear and combined
> > > > translations (i.e. guest-generated translations) is needed.
>
> No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
> is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
> TLBs, and thus KVM is not required to FLUSH_GUEST.
>
> E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
> PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
> to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
> guaranteed to be visible to L2.
>
> But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
> hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
> L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
> KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.
>
> Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
> is not required in this scenario.

Aha, I was examining vmx_flush_tlb_guest() not
kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I
think it's possible that we end up unnecessarily synchronizing the
shadow page tables (or dropping them) in this case.

Do you think it's worth expanding the comment in
nested_vmx_transition_tlb_flush()?

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2ed454186e59c..43d34e413d016 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1239,6 +1239,11 @@ static void
nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
         * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
         * KVM was unable to allocate a VPID for L2, flush the current context
         * as the effective ASID is common to both L1 and L2.
+        *
+        * Note that even though TLB_FLUSH_GUEST would be correct because we
+        * only need to flush linear mappings, it would unnecessarily
+        * synchronize the MMU even though a TLB flush is not architecturally
+        * required from L1's perspective.
         */
        if (!nested_has_guest_tlb_tag(vcpu))
                kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index aa78b6f38dfef..2ed454186e59c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1241,7 +1241,7 @@  static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
 	 * as the effective ASID is common to both L1 and L2.
 	 */
 	if (!nested_has_guest_tlb_tag(vcpu))
-		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 
 static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)