diff mbox series

[v4,12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to use shared mmu_lock in TDP MMU

Message ID 20230714065631.20869-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: refine memtype related mmu zap | expand

Commit Message

Yan Zhao July 14, 2023, 6:56 a.m. UTC
Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
by RCU lock.

GFN zap can be super slow if mmu_lock is held for write when there are
contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
for every GFN.
Contentions can either from concurrent zaps holding mmu_lock for write or
from tdp_mmu_map() holding mmu_lock for read.

After converting to hold mmu_lock for read, there will be less contentions
detected and retaking mmu_lock for read is also faster. There's no need to
flush TLB before dropping mmu_lock when there're contentions as SPTEs have
been zapped atomically and TLBs are flushed/flush requested immediately
within RCU lock.
In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
level are allowed to be zapped if their ranges are fully covered in the
gfn zap range.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     | 14 +++++++----
 arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  1 +
 3 files changed, 60 insertions(+), 5 deletions(-)

Comments

Sean Christopherson Aug. 25, 2023, 9:34 p.m. UTC | #1
On Fri, Jul 14, 2023, Yan Zhao wrote:
> Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> by RCU lock.
> 
> GFN zap can be super slow if mmu_lock is held for write when there are
> contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> for every GFN.
> Contentions can either from concurrent zaps holding mmu_lock for write or
> from tdp_mmu_map() holding mmu_lock for read.

The lock contention should go away with a pre-check[*], correct?  That's a more
complete solution too, in that it also avoids lock contention for the shadow MMU,
which presumably suffers the same problem (I don't see anything that would prevent
it from yielding).

If we do want to zap with mmu_lock held for read, I think we should convert
kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
tasks will race to install SPTEs that are supposed to be zapped.

If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
post it as a standalone patch.  At a glance it doesn't have any dependencies on the
MTRR changes, and I don't want this type of changed buried at the end of a series
that is for a fairly niche setup.  This needs a lot of scrutiny to make sure zapping
under read really is safe.

[*] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com

> After converting to hold mmu_lock for read, there will be less contentions
> detected and retaking mmu_lock for read is also faster. There's no need to
> flush TLB before dropping mmu_lock when there're contentions as SPTEs have
> been zapped atomically and TLBs are flushed/flush requested immediately
> within RCU lock.
> In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
> level are allowed to be zapped if their ranges are fully covered in the
> gfn zap range.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
Yan Zhao Sept. 4, 2023, 7:31 a.m. UTC | #2
On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> On Fri, Jul 14, 2023, Yan Zhao wrote:
> > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > by RCU lock.
> > 
> > GFN zap can be super slow if mmu_lock is held for write when there are
> > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > for every GFN.
> > Contentions can either from concurrent zaps holding mmu_lock for write or
> > from tdp_mmu_map() holding mmu_lock for read.
> 
> The lock contention should go away with a pre-check[*], correct?  That's a more
Yes, I think so, though I don't have time to verify it yet.

> complete solution too, in that it also avoids lock contention for the shadow MMU,
> which presumably suffers the same problem (I don't see anything that would prevent
> it from yielding).
> 
> If we do want to zap with mmu_lock held for read, I think we should convert
> kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> tasks will race to install SPTEs that are supposed to be zapped.
Yes. I did't do that to the unmap path was only because I don't want to make a
big code change.
The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
which is not easy to change, right?

> 
> If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> post it as a standalone patch.  At a glance it doesn't have any dependencies on the
> MTRR changes, and I don't want this type of changed buried at the end of a series
> that is for a fairly niche setup.  This needs a lot of scrutiny to make sure zapping
> under read really is safe
Given the pre-check patch should work, do you think it's still worthwhile to do
this convertion?

> 
> [*] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com
> 
> > After converting to hold mmu_lock for read, there will be less contentions
> > detected and retaking mmu_lock for read is also faster. There's no need to
> > flush TLB before dropping mmu_lock when there're contentions as SPTEs have
> > been zapped atomically and TLBs are flushed/flush requested immediately
> > within RCU lock.
> > In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
> > level are allowed to be zapped if their ranges are fully covered in the
> > gfn zap range.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
Sean Christopherson Sept. 5, 2023, 10:31 p.m. UTC | #3
On Mon, Sep 04, 2023, Yan Zhao wrote:
> On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> > On Fri, Jul 14, 2023, Yan Zhao wrote:
> > > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > > by RCU lock.
> > > 
> > > GFN zap can be super slow if mmu_lock is held for write when there are
> > > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > > for every GFN.
> > > Contentions can either from concurrent zaps holding mmu_lock for write or
> > > from tdp_mmu_map() holding mmu_lock for read.
> > 
> > The lock contention should go away with a pre-check[*], correct?  That's a more
> Yes, I think so, though I don't have time to verify it yet.
> 
> > complete solution too, in that it also avoids lock contention for the shadow MMU,
> > which presumably suffers the same problem (I don't see anything that would prevent
> > it from yielding).
> > 
> > If we do want to zap with mmu_lock held for read, I think we should convert
> > kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> > missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> > the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> > tasks will race to install SPTEs that are supposed to be zapped.
> Yes. I did't do that to the unmap path was only because I don't want to make a
> big code change.
> The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
> which is not easy to change, right?

Yeah.  The lock itself isn't bad, especially if we can convert all mmu_nofitier
hooks, e.g. we already have KVM_MMU_LOCK(), adding a variant for mmu_notifiers
would be quite easy.

The bigger problem would be kvm_mmu_invalidate_{begin,end}() and getting the
memory ordering right, especially if there are multiple mmu_notifier events in
flight.

But I was actually thinking of a cheesier approach: drop and reacquire mmu_lock
when zapping, e.g. without the necessary changes in tdp_mmu_zap_leafs():

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 735c976913c2..c89a2511789b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -882,9 +882,15 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
 {
        struct kvm_mmu_page *root;
 
+       write_unlock(&kvm->mmu_lock);
+       read_lock(&kvm->mmu_lock);
+
        for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
                flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
 
+       read_unlock(&kvm->mmu_lock);
+       write_lock(&kvm->mmu_lock);
+
        return flush;
 }

vCPUs would still get blocked, but for a smaller duration, and the lock contention
between vCPUs and the zapping task would mostly go away.

> > If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> > post it as a standalone patch.  At a glance it doesn't have any dependencies on the
> > MTRR changes, and I don't want this type of changed buried at the end of a series
> > that is for a fairly niche setup.  This needs a lot of scrutiny to make sure zapping
> > under read really is safe
> Given the pre-check patch should work, do you think it's still worthwhile to do
> this convertion?

I do think it would be a net positive, though I don't know that it's worth your
time without a concrete use cases.  My gut instinct could be wrong, so I wouldn't
want to take on the risk of running with mmu_lock held for read without hard
performance numbers to justify the change.
Yan Zhao Sept. 6, 2023, 12:50 a.m. UTC | #4
On Tue, Sep 05, 2023 at 03:31:59PM -0700, Sean Christopherson wrote:
> On Mon, Sep 04, 2023, Yan Zhao wrote:
> > On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> > > On Fri, Jul 14, 2023, Yan Zhao wrote:
> > > > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > > > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > > > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > > > by RCU lock.
> > > > 
> > > > GFN zap can be super slow if mmu_lock is held for write when there are
> > > > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > > > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > > > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > > > for every GFN.
> > > > Contentions can either from concurrent zaps holding mmu_lock for write or
> > > > from tdp_mmu_map() holding mmu_lock for read.
> > > 
> > > The lock contention should go away with a pre-check[*], correct?  That's a more
> > Yes, I think so, though I don't have time to verify it yet.
> > 
> > > complete solution too, in that it also avoids lock contention for the shadow MMU,
> > > which presumably suffers the same problem (I don't see anything that would prevent
> > > it from yielding).
> > > 
> > > If we do want to zap with mmu_lock held for read, I think we should convert
> > > kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> > > missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> > > the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> > > tasks will race to install SPTEs that are supposed to be zapped.
> > Yes. I did't do that to the unmap path was only because I don't want to make a
> > big code change.
> > The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
> > which is not easy to change, right?
> 
> Yeah.  The lock itself isn't bad, especially if we can convert all mmu_nofitier
> hooks, e.g. we already have KVM_MMU_LOCK(), adding a variant for mmu_notifiers
> would be quite easy.
>
> The bigger problem would be kvm_mmu_invalidate_{begin,end}() and getting the
> memory ordering right, especially if there are multiple mmu_notifier events in
> flight.
> 
> But I was actually thinking of a cheesier approach: drop and reacquire mmu_lock
> when zapping, e.g. without the necessary changes in tdp_mmu_zap_leafs():
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 735c976913c2..c89a2511789b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -882,9 +882,15 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
>  {
>         struct kvm_mmu_page *root;
>  
> +       write_unlock(&kvm->mmu_lock);
> +       read_lock(&kvm->mmu_lock);
> +
>         for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
>                 flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
>  
> +       read_unlock(&kvm->mmu_lock);
> +       write_lock(&kvm->mmu_lock);
> +
>         return flush;
>  }
> 
> vCPUs would still get blocked, but for a smaller duration, and the lock contention
> between vCPUs and the zapping task would mostly go away.
>
Yes, I actually did similar thing locally, i.e. releasing write lock and taking
read lock before zapping.
But yes, I also think it's cheesier as the caller of the write lock knows nothing
about its write lock was replaced with read lock.


> > > If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> > > post it as a standalone patch.  At a glance it doesn't have any dependencies on the
> > > MTRR changes, and I don't want this type of changed buried at the end of a series
> > > that is for a fairly niche setup.  This needs a lot of scrutiny to make sure zapping
> > > under read really is safe
> > Given the pre-check patch should work, do you think it's still worthwhile to do
> > this convertion?
> 
> I do think it would be a net positive, though I don't know that it's worth your
> time without a concrete use cases.  My gut instinct could be wrong, so I wouldn't
> want to take on the risk of running with mmu_lock held for read without hard
> performance numbers to justify the change.
Ok, I see. May try conversion later if I found out the performance justification.

Thanks!
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7f52bbe013b3..1fa2a0a3fc9b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6310,15 +6310,19 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
+	if (flush)
+		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
+
 	if (tdp_mmu_enabled) {
+		write_unlock(&kvm->mmu_lock);
+		read_lock(&kvm->mmu_lock);
+
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
-						      gfn_end, true, flush);
+			kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start, gfn_end);
+		read_unlock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 	}
 
-	if (flush)
-		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
-
 	kvm_mmu_invalidate_end(kvm, 0, -1ul);
 
 	write_unlock(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 512163d52194..2ad18275b643 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -888,6 +888,56 @@  bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
 	return flush;
 }
 
+static void zap_gfn_range_atomic(struct kvm *kvm, struct kvm_mmu_page *root,
+				 gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+
+	end = min(end, tdp_mmu_max_gfn_exclusive());
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	rcu_read_lock();
+
+	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * As also documented in tdp_mmu_zap_root(),
+		 * KVM must be able to zap a 1gb shadow page without
+		 * inducing a stall to allow in-place replacement with a 1gb hugepage.
+		 */
+		if (iter.gfn < start ||
+		    iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
+		    iter.level > KVM_MAX_HUGEPAGE_LEVEL)
+			continue;
+
+		/* Note, a successful atomic zap also does a remote TLB flush. */
+		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
+			goto retry;
+	}
+
+	rcu_read_unlock();
+}
+
+/*
+ * Zap all SPTEs for the range of gfns, [start, end), for all roots with
+ * shared mmu lock in atomic way.
+ * TLB flushs are performed within the rcu lock.
+ */
+void kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, gfn_t end)
+{
+	struct kvm_mmu_page *root;
+
+	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id, true)
+		zap_gfn_range_atomic(kvm, root, start, end);
+}
+
 void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..90856bd7a2fd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,6 +22,7 @@  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
 				 gfn_t end, bool can_yield, bool flush);
+void kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, gfn_t end);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);