diff mbox series

KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE

Message ID 20240307194059.1357377-1-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE | expand

Commit Message

David Matlack March 7, 2024, 7:40 p.m. UTC
Process SPTEs zapped under the read-lock after the TLB flush and
replacement of REMOVED_SPTE with 0. This minimizes the contention on the
child SPTEs (if zapping an SPTE that points to a page table) and
minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.

In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to
process a 1GiB region mapped with 4KiB entries, e.g. when disabling
dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
vCPU accesses the 1GiB region being zapped it will be stalled until KVM
finishes processing the SPTE and replaces the REMOVED_SPTE with 0.

Re-ordering the processing does speed up the atomic-zaps somewhat, but
the main benefit is avoiding blocking vCPU threads.

Before:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
 ...
 Disabling dirty logging time: 509.765146313s

 $ ./funclatency -m tdp_mmu_zap_spte_atomic

     msec                : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 8        |**                                      |
       256 -> 511        : 68       |******************                      |
       512 -> 1023       : 129      |**********************************      |
      1024 -> 2047       : 151      |****************************************|
      2048 -> 4095       : 60       |***************                         |

After:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
 ...
 Disabling dirty logging time: 336.516838548s

 $ ./funclatency -m tdp_mmu_zap_spte_atomic

     msec                : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 12       |**                                      |
       256 -> 511        : 166      |****************************************|
       512 -> 1023       : 101      |************************                |
      1024 -> 2047       : 137      |*********************************       |

KVM's processing of collapsible SPTEs is still extremely slow and can be
improved. For example, a significant amount of time is spent calling
kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is
redundant when processing SPTEs that all map the folio.

Cc: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 26 deletions(-)


base-commit: 0c64952fec3ea01cb5b09f00134200f3e7ab40d5

Comments

Vipin Sharma March 7, 2024, 9:34 p.m. UTC | #1
On Thu, Mar 7, 2024 at 11:41 AM David Matlack <dmatlack@google.com> wrote:
>
> Process SPTEs zapped under the read-lock after the TLB flush and
> replacement of REMOVED_SPTE with 0. This minimizes the contention on the
> child SPTEs (if zapping an SPTE that points to a page table) and
> minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.
>
> In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to
> process a 1GiB region mapped with 4KiB entries, e.g. when disabling
> dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
> vCPU accesses the 1GiB region being zapped it will be stalled until KVM
> finishes processing the SPTE and replaces the REMOVED_SPTE with 0.
>
> Re-ordering the processing does speed up the atomic-zaps somewhat, but
> the main benefit is avoiding blocking vCPU threads.
>
> Before:
>
>  $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
>  ...
>  Disabling dirty logging time: 509.765146313s
>
>  $ ./funclatency -m tdp_mmu_zap_spte_atomic
>
>      msec                : count    distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 8        |**                                      |
>        256 -> 511        : 68       |******************                      |
>        512 -> 1023       : 129      |**********************************      |
>       1024 -> 2047       : 151      |****************************************|
>       2048 -> 4095       : 60       |***************                         |
>
> After:
>
>  $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
>  ...
>  Disabling dirty logging time: 336.516838548s
>
>  $ ./funclatency -m tdp_mmu_zap_spte_atomic
>
>      msec                : count    distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 12       |**                                      |
>        256 -> 511        : 166      |****************************************|
>        512 -> 1023       : 101      |************************                |
>       1024 -> 2047       : 137      |*********************************       |

Nice! Whole 2048-> 4095 is gone.

>
> KVM's processing of collapsible SPTEs is still extremely slow and can be
> improved. For example, a significant amount of time is spent calling
> kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is
> redundant when processing SPTEs that all map the folio.
>
> Cc: Vipin Sharma <vipinsh@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------
>  1 file changed, 55 insertions(+), 26 deletions(-)
>
Reviewed-by: Vipin Sharma <vipinsh@google.com>
Sean Christopherson April 9, 2024, 11:31 p.m. UTC | #2
On Thu, Mar 07, 2024, David Matlack wrote:
> Process SPTEs zapped under the read-lock after the TLB flush and
> replacement of REMOVED_SPTE with 0. This minimizes the contention on the
> child SPTEs (if zapping an SPTE that points to a page table) and
> minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.

I think it's worth explicitly calling out the impact of each part of the change,
i.e. that flushing TLBs avoids child SPTE contention, and waiting until the SPTE
is back to '0' avoids blocking vCPUs.  That's kinda sorta implied by the ordering,
but I'm guessing it won't be super obvious to folks that aren't already familiar
with the TDP MMU.

> In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to

large (400+) "number of" vCPUs

> process a 1GiB region mapped with 4KiB entries, e.g. when disabling
> dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
> vCPU accesses the 1GiB region being zapped it will be stalled until KVM
> finishes processing the SPTE and replaces the REMOVED_SPTE with 0.

...

> KVM's processing of collapsible SPTEs is still extremely slow and can be
> improved. For example, a significant amount of time is spent calling
> kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is
> redundant when processing SPTEs that all map the folio.

"same folio"

I also massaged this to make it clear that this is a "hey, by the way" sorta
thing, and that _this_ patch is valuable even if/when we go clean up the A/D
performance mess.

> +	 * This should be safe because KVM does not depend on any of the

Don't hedge.  If we're wrong and it's not safe, then there will be revert with
a changelog explaining exactly why we're wrong.  But in the (hopefully) likely
case that we're correct, hedging only confuses future readers, e.g. unnecessarily
makes them wonder if maaaaybe this code isn't actually safe.

> +	 * processing completing before a new SPTE is installed to map a given
> +	 * GFN. Case in point, kvm_mmu_zap_all_fast() can result in KVM
> +	 * processing all SPTEs in a given root after vCPUs create mappings in
> +	 * a new root.

This belongs in the changelog, as it references a very specific point in time.
It's extremely unlikely, but if kvm_mmu_zap_all_fast()'s behavior changed then
we'd end up with a stale comment that doesn't actually provide much value to the
code it's commenting.

I'll fix up all of the above when applying (it's basically just me obsessing over
the changelog).
Sean Christopherson April 10, 2024, 12:19 a.m. UTC | #3
On Thu, 07 Mar 2024 11:40:59 -0800, David Matlack wrote:
> Process SPTEs zapped under the read-lock after the TLB flush and
> replacement of REMOVED_SPTE with 0. This minimizes the contention on the
> child SPTEs (if zapping an SPTE that points to a page table) and
> minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.
> 
> In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to
> process a 1GiB region mapped with 4KiB entries, e.g. when disabling
> dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
> vCPU accesses the 1GiB region being zapped it will be stalled until KVM
> finishes processing the SPTE and replaces the REMOVED_SPTE with 0.
> 
> [...]

Applied to kvm-x86 mmu, with the tweaks mentioned earlier.  Thanks!

[1/1] KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE
      https://github.com/kvm-x86/linux/commit/aca48556c592

--
https://github.com/kvm-x86/linux/tree/next
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..e169e7ee6c40 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -530,6 +530,31 @@  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
+static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+{
+	u64 *sptep = rcu_dereference(iter->sptep);
+
+	/*
+	 * The caller is responsible for ensuring the old SPTE is not a REMOVED
+	 * SPTE.  KVM should never attempt to zap or manipulate a REMOVED SPTE,
+	 * and pre-checking before inserting a new SPTE is advantageous as it
+	 * avoids unnecessary work.
+	 */
+	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+
+	/*
+	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
+	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
+	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
+	 * the current value, so the caller operates on fresh data, e.g. if it
+	 * retries tdp_mmu_set_spte_atomic()
+	 */
+	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+		return -EBUSY;
+
+	return 0;
+}
+
 /*
  * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
  * and handle the associated bookkeeping.  Do not mark the page dirty
@@ -551,27 +576,13 @@  static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					  struct tdp_iter *iter,
 					  u64 new_spte)
 {
-	u64 *sptep = rcu_dereference(iter->sptep);
-
-	/*
-	 * The caller is responsible for ensuring the old SPTE is not a REMOVED
-	 * SPTE.  KVM should never attempt to zap or manipulate a REMOVED SPTE,
-	 * and pre-checking before inserting a new SPTE is advantageous as it
-	 * avoids unnecessary work.
-	 */
-	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	/*
-	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
-	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
-	 * the current value, so the caller operates on fresh data, e.g. if it
-	 * retries tdp_mmu_set_spte_atomic()
-	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-		return -EBUSY;
+	ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+	if (ret)
+		return ret;
 
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			    new_spte, iter->level, true);
@@ -584,13 +595,17 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 {
 	int ret;
 
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
 	/*
-	 * Freeze the SPTE by setting it to a special,
-	 * non-present value. This will stop other threads from
-	 * immediately installing a present entry in its place
-	 * before the TLBs are flushed.
+	 * Freeze the SPTE by setting it to a special, non-present value. This
+	 * will stop other threads from immediately installing a present entry
+	 * in its place before the TLBs are flushed.
+	 *
+	 * Delay processing of the zapped SPTE until after TLBs are flushed and
+	 * the REMOVED_SPTE is replaced (see below).
 	 */
-	ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
 	if (ret)
 		return ret;
 
@@ -599,12 +614,26 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	/*
 	 * No other thread can overwrite the removed SPTE as they must either
 	 * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
-	 * overwrite the special removed SPTE value. No bookkeeping is needed
-	 * here since the SPTE is going from non-present to non-present.  Use
-	 * the raw write helper to avoid an unnecessary check on volatile bits.
+	 * overwrite the special removed SPTE value. Use the raw write helper to
+	 * avoid an unnecessary check on volatile bits.
 	 */
 	__kvm_tdp_mmu_write_spte(iter->sptep, 0);
 
+	/*
+	 * Process the zapped SPTE after flushing TLBs and replacing
+	 * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
+	 * blocked by the REMOVED_SPTE and reduces contention on the child
+	 * SPTEs.
+	 *
+	 * This should be safe because KVM does not depend on any of the
+	 * processing completing before a new SPTE is installed to map a given
+	 * GFN. Case in point, kvm_mmu_zap_all_fast() can result in KVM
+	 * processing all SPTEs in a given root after vCPUs create mappings in
+	 * a new root.
+	 */
+	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
+			    0, iter->level, true);
+
 	return 0;
 }