diff mbox series

[v3,01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

Message ID 20200320212833.3507-2-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: TLB flushing fixes and enhancements | expand

Commit Message

Sean Christopherson March 20, 2020, 9:27 p.m. UTC
Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH.  Remote TLB
flushes require all contexts to be invalidated, not just the active
contexts, e.g. all mappings in all contexts for a given HVA need to be
invalidated on a mmu_notifier invalidation.  Similarly, the instigator
of the deferred TLB flush may be expecting all contexts to be flushed,
e.g. vmx_vcpu_load_vmcs().

Without nested VMX, flushing only the current EPTP/VPID context isn't
problematic because KVM uses a constant VPID for each vCPU, and
mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP
for L1.  In the rare case where a different EPTP is created or reused,
KVM (currently) unconditionally flushes the new EPTP context prior to
entering the guest.

With nested VMX, KVM conditionally uses a different VPID for L2, and
unconditionally uses a different EPTP for L2.  Because KVM doesn't
_intentionally_ guarantee L2's EPTP/VPID context is flushed on nested
VM-Enter, it'd be possible for a malicious L1 to attack the host and/or
different VMs by exploiting the lack of flushing for L2.

  1) Launch nested guest from malicious L1.

  2) Nested VM-Enter to L2.

  3) Access target GPA 'g'.  CPU inserts TLB entry tagged with L2's ASID
     mapping 'g' to host PFN 'x'.

  2) Nested VM-Exit to L1.

  3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing
     the page for PFN 'x'.

  4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and
     remaps the page to PFN 'y'.  mmu_notifier sends invalidate command,
     KVM flushes TLB only for L1's ASID.

  4) Host kernel reallocates PFN 'x' to some other task/guest.

  5) Nested VM-Enter to L2.  KVM does not invalidate L2's EPTP or VPID.

  6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its
     stale TLB entry.

However, current KVM unconditionally flushes L1's EPTP/VPID context on
nested VM-Exit.  But, that behavior is mostly unintentional, KVM doesn't
go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather
a TLB flush is guaranteed to occur prior to re-entering L1 due to
__kvm_mmu_new_cr3() always being called with skip_tlb_flush=false.  On
nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT
enabled) or in nested_vmx_load_cr3() (nested EPT disabled).  On nested
VM-Exit it occurs via nested_vmx_load_cr3().

This also fixes a bug where a deferred TLB flush in the context of L2,
with EPT disabled, would flush L1's VPID instead of L2's VPID, as
vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode().

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Junaid Shahid <junaids@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: John Haxby <john.haxby@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Fixes: efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/vmx.h | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

Comments

Lai Jiangshan Aug. 3, 2021, 1:45 a.m. UTC | #1
(I'm replying to a very old email, so many CCs are dropped.)

On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH.  Remote TLB
> flushes require all contexts to be invalidated, not just the active
> contexts, e.g. all mappings in all contexts for a given HVA need to be
> invalidated on a mmu_notifier invalidation.  Similarly, the instigator
> of the deferred TLB flush may be expecting all contexts to be flushed,
> e.g. vmx_vcpu_load_vmcs().
>
> Without nested VMX, flushing only the current EPTP/VPID context isn't
> problematic because KVM uses a constant VPID for each vCPU, and

Hello, Sean

Is the patch optimized for cases where nested VMX is active?
I think the non-nested cases are normal cases.

Although the related code has been changed, the logic of the patch
is still working now, would it be better if we restore the optimization
for the normal cases (non-nested)?

Thanks
Lai

> mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP
> for L1.  In the rare case where a different EPTP is created or reused,
> KVM (currently) unconditionally flushes the new EPTP context prior to
> entering the guest.
>
> With nested VMX, KVM conditionally uses a different VPID for L2, and
> unconditionally uses a different EPTP for L2.  Because KVM doesn't
> _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested
> VM-Enter, it'd be possible for a malicious L1 to attack the host and/or
> different VMs by exploiting the lack of flushing for L2.
>
>   1) Launch nested guest from malicious L1.
>
>   2) Nested VM-Enter to L2.
>
>   3) Access target GPA 'g'.  CPU inserts TLB entry tagged with L2's ASID
>      mapping 'g' to host PFN 'x'.
>
>   2) Nested VM-Exit to L1.
>
>   3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing
>      the page for PFN 'x'.
>
>   4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and
>      remaps the page to PFN 'y'.  mmu_notifier sends invalidate command,
>      KVM flushes TLB only for L1's ASID.
>
>   4) Host kernel reallocates PFN 'x' to some other task/guest.
>
>   5) Nested VM-Enter to L2.  KVM does not invalidate L2's EPTP or VPID.
>
>   6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its
>      stale TLB entry.
>
> However, current KVM unconditionally flushes L1's EPTP/VPID context on
> nested VM-Exit.  But, that behavior is mostly unintentional, KVM doesn't
> go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather
> a TLB flush is guaranteed to occur prior to re-entering L1 due to
> __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false.  On
> nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT
> enabled) or in nested_vmx_load_cr3() (nested EPT disabled).  On nested
> VM-Exit it occurs via nested_vmx_load_cr3().
>
> This also fixes a bug where a deferred TLB flush in the context of L2,
> with EPT disabled, would flush L1's VPID instead of L2's VPID, as
> vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode().
>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Junaid Shahid <junaids@google.com>
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: John Haxby <john.haxby@oracle.com>
> Reviewed-by: Liran Alon <liran.alon@oracle.com>
> Fixes: efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.h | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index be93d597306c..d6d67b816ebe 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -518,7 +518,33 @@ static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
>
>  static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
>  {
> -       __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid, invalidate_gpa);
> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +       /*
> +        * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
> +        * invoked via kvm_flush_remote_tlbs(), which always passes %true for
> +        * @invalidate_gpa.  Flushing remote TLBs requires all contexts to be
> +        * flushed, not just the active context.
> +        *
> +        * Note, this also ensures a deferred TLB flush with VPID enabled and
> +        * EPT disabled invalidates the "correct" VPID, by nuking both L1 and
> +        * L2's VPIDs.
> +        */
> +       if (invalidate_gpa) {
> +               if (enable_ept) {
> +                       ept_sync_global();
> +               } else if (enable_vpid) {
> +                       if (cpu_has_vmx_invvpid_global()) {
> +                               vpid_sync_vcpu_global();
> +                       } else {
> +                               WARN_ON_ONCE(!cpu_has_vmx_invvpid_single());
> +                               vpid_sync_vcpu_single(vmx->vpid);
> +                               vpid_sync_vcpu_single(vmx->nested.vpid02);
> +                       }
> +               }
> +       } else {
> +               __vmx_flush_tlb(vcpu, vmx->vpid, false);
> +       }
>  }
>
>  static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
> --
> 2.24.1
>
Sean Christopherson Aug. 3, 2021, 3:39 p.m. UTC | #2
On Tue, Aug 03, 2021, Lai Jiangshan wrote:
> (I'm replying to a very old email, so many CCs are dropped.)
> 
> On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> > a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH.  Remote TLB
> > flushes require all contexts to be invalidated, not just the active
> > contexts, e.g. all mappings in all contexts for a given HVA need to be
> > invalidated on a mmu_notifier invalidation.  Similarly, the instigator
> > of the deferred TLB flush may be expecting all contexts to be flushed,
> > e.g. vmx_vcpu_load_vmcs().
> >
> > Without nested VMX, flushing only the current EPTP/VPID context isn't
> > problematic because KVM uses a constant VPID for each vCPU, and
> 
> Hello, Sean
> 
> Is the patch optimized for cases where nested VMX is active?

Well, this patch isn't, but KVM has since been optimized to do full EPT/VPID
flushes only when "necessary".  Necessary in quotes because the two uses can
technically be further optimized, but doing so would incur significant complexity.

Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
but KVM would need to propagate more information (mmu_role?) in order for responding
vCPUs to determine what contexts needs to be flushed.  And practically speaking,
for MMU flushes there's no meaningful difference when using TDP without nested
guests as the common case will be that each vCPU has a single active EPTP and
that EPTP will be affected by the MMU changes, i.e. needs to be flushed.

Use #2 is in VMX's pCPU migration path.  Again, not strictly necessary as KVM could
theoretically track which pCPUs have run a particular vCPU and when that pCPU last
flushed EPT contexts, but fully solving the problem would be quite complex.  Since
pCPU migration is always going to be a slow path, the extra complexity would be
very difficult to justify.

> I think the non-nested cases are normal cases.
> 
> Although the related code has been changed, the logic of the patch
> is still working now, would it be better if we restore the optimization
> for the normal cases (non-nested)?

As above, vmx_flush_tlb_all() hasn't changed, but the callers have.
Lai Jiangshan Aug. 4, 2021, 3:11 a.m. UTC | #3
On Tue, Aug 3, 2021 at 11:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Aug 03, 2021, Lai Jiangshan wrote:
> > (I'm replying to a very old email, so many CCs are dropped.)
> >
> > On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> > > a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH.  Remote TLB
> > > flushes require all contexts to be invalidated, not just the active
> > > contexts, e.g. all mappings in all contexts for a given HVA need to be
> > > invalidated on a mmu_notifier invalidation.  Similarly, the instigator
> > > of the deferred TLB flush may be expecting all contexts to be flushed,
> > > e.g. vmx_vcpu_load_vmcs().
> > >
> > > Without nested VMX, flushing only the current EPTP/VPID context isn't
> > > problematic because KVM uses a constant VPID for each vCPU, and
> >
> > Hello, Sean
> >
> > Is the patch optimized for cases where nested VMX is active?
>
> Well, this patch isn't, but KVM has since been optimized to do full EPT/VPID
> flushes only when "necessary".  Necessary in quotes because the two uses can
> technically be further optimized, but doing so would incur significant complexity.

Hello, thanks for your reply.

I know there might be a lot of possible optimizations to be considered, many of
which are too complicated to be implemented.

The optimization I considered yesterday is "ept_sync_global() V.S.
ept_sync_context(this_vcpu's)" in the case: when the VM is using EPT and
doesn't allow nested VMs.  (And I failed to express it yesterday)

In this case, the vCPU uses only one single root_hpa, and I think ept sync
for single context is enough for both cases you listed below.

When the context is flushed, the TLB for the vCPU is clean to run.

If kvm changes the mmu->root_hpa, it is kvm's responsibility to request
another flush which is implemented.

In other words, KVM_REQ_TLB_FLUSH == KVM_REQ_TLB_FLUSH_CURRENT in this case.
And before this patch, kvm flush only the single context rather than global.

>
> Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
> but KVM would need to propagate more information (mmu_role?) in order for responding
> vCPUs to determine what contexts needs to be flushed.  And practically speaking,
> for MMU flushes there's no meaningful difference when using TDP without nested
> guests as the common case will be that each vCPU has a single active EPTP and
> that EPTP will be affected by the MMU changes, i.e. needs to be flushed.

I don't see when we need "to determine what contexts" since the vcpu is
using only one context in this case which is the assumption in my mind,
could you please correct me if I'm wrong.

Thanks,
Lai.

>
> Use #2 is in VMX's pCPU migration path.  Again, not strictly necessary as KVM could
> theoretically track which pCPUs have run a particular vCPU and when that pCPU last
> flushed EPT contexts, but fully solving the problem would be quite complex.  Since
> pCPU migration is always going to be a slow path, the extra complexity would be
> very difficult to justify.
>
> > I think the non-nested cases are normal cases.
> >
> > Although the related code has been changed, the logic of the patch
> > is still working now, would it be better if we restore the optimization
> > for the normal cases (non-nested)?
>
> As above, vmx_flush_tlb_all() hasn't changed, but the callers have.
Sean Christopherson Aug. 4, 2021, 3:33 p.m. UTC | #4
On Wed, Aug 04, 2021, Lai Jiangshan wrote:
> The optimization I considered yesterday is "ept_sync_global() V.S.
> ept_sync_context(this_vcpu's)" in the case: when the VM is using EPT and
> doesn't allow nested VMs.  (And I failed to express it yesterday)
> 
> In this case, the vCPU uses only one single root_hpa,

This is not strictly guaranteed.  kvm_mmu_page_role tracks efer.NX, cr0.wp, and
cr4.SMEP/SMAP (if cr0.wp=0), which means that KVM will create a a different root
if the guest toggles any of those bits.  I'm pretty sure that can be changed and
will look into doing so in the near future[*], but even that wouldn't guarantee
a single root.

SMM is also incorporated in the page role and will result in a different roots
for SMM vs. non-SMM.  This is mandatory because SMM has its own memslot view.

A CPUID.MAXPHYADDR change can also change the role, but in this case zapping all
roots will always be the correct/desired behavior.

[*] https://lkml.kernel.org/r/YQGj8gj7fpWDdLg5@google.com

> and I think ept sync for single context is enough for both cases you listed below.
> 
> When the context is flushed, the TLB for the vCPU is clean to run.
> 
> If kvm changes the mmu->root_hpa, it is kvm's responsibility to request
> another flush which is implemented.

KVM needs to flush when it allocates a new root, largely because it has no way
of knowing if some other entity previously created a CR3/EPTP at that HPA, but
KVM isn't strictly required to flush when switching to a previous/cached root.

Currently this is a moot point because kvm_post_set_cr0(), kvm_post_set_cr4(),
set_efer(), and kvm_smm_changed() all do kvm_mmu_reset_context() instead of
attempting a fast PGD switch, but I am hoping to change this as well, at least
for the non-SMM cases.

> In other words, KVM_REQ_TLB_FLUSH == KVM_REQ_TLB_FLUSH_CURRENT in this case.
> And before this patch, kvm flush only the single context rather than global.
> 
> >
> > Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
> > but KVM would need to propagate more information (mmu_role?) in order for responding
> > vCPUs to determine what contexts needs to be flushed.  And practically speaking,
> > for MMU flushes there's no meaningful difference when using TDP without nested
> > guests as the common case will be that each vCPU has a single active EPTP and
> > that EPTP will be affected by the MMU changes, i.e. needs to be flushed.
> 
> I don't see when we need "to determine what contexts" since the vcpu is
> using only one context in this case which is the assumption in my mind,
> could you please correct me if I'm wrong.

As it exists today, I believe you're correct that KVM will only ever have a
single reachable TDP root, but only because of overzealous kvm_mmu_reset_context()
usage.  The SMM case in particular could be optimized to not zap all roots (whether
or not it's worth optimizing is another question).

All that said, the easiest way to query the number of reachable roots would be to
check the previous/cached root.

But, even if we can guarantee there's exactly one reachable root, I would be
surprised if doing INVEPT.context instead of INVEPT.global actually provided any
meaningful performance benefit.  Using INVEPT.context is safe if and only if there
are no other TLB entries for this vCPU, and KVM must invalidate on pCPU migration,
so there can't be collateral damage in that sense.

That leaves the latency of INVEPT as the only possible performance delta, and that
will be uarch specific.  It's entirely possible INVEPT.global is slower, but again
I would be surprised if it is so much slower than INVEPT.context that it actually
impacts guest performance given that its use is limited to slow paths.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index be93d597306c..d6d67b816ebe 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -518,7 +518,33 @@  static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
 
 static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 {
-	__vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid, invalidate_gpa);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/*
+	 * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
+	 * invoked via kvm_flush_remote_tlbs(), which always passes %true for
+	 * @invalidate_gpa.  Flushing remote TLBs requires all contexts to be
+	 * flushed, not just the active context.
+	 *
+	 * Note, this also ensures a deferred TLB flush with VPID enabled and
+	 * EPT disabled invalidates the "correct" VPID, by nuking both L1 and
+	 * L2's VPIDs.
+	 */
+	if (invalidate_gpa) {
+		if (enable_ept) {
+			ept_sync_global();
+		} else if (enable_vpid) {
+			if (cpu_has_vmx_invvpid_global()) {
+				vpid_sync_vcpu_global();
+			} else {
+				WARN_ON_ONCE(!cpu_has_vmx_invvpid_single());
+				vpid_sync_vcpu_single(vmx->vpid);
+				vpid_sync_vcpu_single(vmx->nested.vpid02);
+			}
+		}
+	} else {
+		__vmx_flush_tlb(vcpu, vmx->vpid, false);
+	}
 }
 
 static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)