Message ID | 20240307194059.1357377-1-dmatlack@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE | expand |
On Thu, Mar 7, 2024 at 11:41 AM David Matlack <dmatlack@google.com> wrote: > > Process SPTEs zapped under the read-lock after the TLB flush and > replacement of REMOVED_SPTE with 0. This minimizes the contention on the > child SPTEs (if zapping an SPTE that points to a page table) and > minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. > > In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to > process a 1GiB region mapped with 4KiB entries, e.g. when disabling > dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a > vCPU accesses the 1GiB region being zapped it will be stalled until KVM > finishes processing the SPTE and replaces the REMOVED_SPTE with 0. > > Re-ordering the processing does speed up the atomic-zaps somewhat, but > the main benefit is avoiding blocking vCPU threads. > > Before: > > $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e > ... > Disabling dirty logging time: 509.765146313s > > $ ./funclatency -m tdp_mmu_zap_spte_atomic > > msec : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 8 |** | > 256 -> 511 : 68 |****************** | > 512 -> 1023 : 129 |********************************** | > 1024 -> 2047 : 151 |****************************************| > 2048 -> 4095 : 60 |*************** | > > After: > > $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e > ... > Disabling dirty logging time: 336.516838548s > > $ ./funclatency -m tdp_mmu_zap_spte_atomic > > msec : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 12 |** | > 256 -> 511 : 166 |****************************************| > 512 -> 1023 : 101 |************************ | > 1024 -> 2047 : 137 |********************************* | Nice! Whole 2048-> 4095 is gone. > > KVM's processing of collapsible SPTEs is still extremely slow and can be > improved. For example, a significant amount of time is spent calling > kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is > redundant when processing SPTEs that all map the folio. > > Cc: Vipin Sharma <vipinsh@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------ > 1 file changed, 55 insertions(+), 26 deletions(-) > Reviewed-by: Vipin Sharma <vipinsh@google.com>
On Thu, Mar 07, 2024, David Matlack wrote: > Process SPTEs zapped under the read-lock after the TLB flush and > replacement of REMOVED_SPTE with 0. This minimizes the contention on the > child SPTEs (if zapping an SPTE that points to a page table) and > minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. I think it's worth explicitly calling out the impact of each part of the change, i.e. that flushing TLBs avoids child SPTE contention, and waiting until the SPTE is back to '0' avoids blocking vCPUs. That's kinda sorta implied by the ordering, but I'm guessing it won't be super obvious to folks that aren't already familiar with the TDP MMU. > In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to large (400+) "number of" vCPUs > process a 1GiB region mapped with 4KiB entries, e.g. when disabling > dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a > vCPU accesses the 1GiB region being zapped it will be stalled until KVM > finishes processing the SPTE and replaces the REMOVED_SPTE with 0. ... > KVM's processing of collapsible SPTEs is still extremely slow and can be > improved. For example, a significant amount of time is spent calling > kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is > redundant when processing SPTEs that all map the folio. "same folio" I also massaged this to make it clear that this is a "hey, by the way" sorta thing, and that _this_ patch is valuable even if/when we go clean up the A/D performance mess. > + * This should be safe because KVM does not depend on any of the Don't hedge. If we're wrong and it's not safe, then there will be revert with a changelog explaining exactly why we're wrong. But in the (hopefully) likely case that we're correct, hedging only confuses future readers, e.g. unnecessarily makes them wonder if maaaaybe this code isn't actually safe. > + * processing completing before a new SPTE is installed to map a given > + * GFN. Case in point, kvm_mmu_zap_all_fast() can result in KVM > + * processing all SPTEs in a given root after vCPUs create mappings in > + * a new root. This belongs in the changelog, as it references a very specific point in time. It's extremely unlikely, but if kvm_mmu_zap_all_fast()'s behavior changed then we'd end up with a stale comment that doesn't actually provide much value to the code it's commenting. I'll fix up all of the above when applying (it's basically just me obsessing over the changelog).
On Thu, 07 Mar 2024 11:40:59 -0800, David Matlack wrote: > Process SPTEs zapped under the read-lock after the TLB flush and > replacement of REMOVED_SPTE with 0. This minimizes the contention on the > child SPTEs (if zapping an SPTE that points to a page table) and > minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. > > In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to > process a 1GiB region mapped with 4KiB entries, e.g. when disabling > dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a > vCPU accesses the 1GiB region being zapped it will be stalled until KVM > finishes processing the SPTE and replaces the REMOVED_SPTE with 0. > > [...] Applied to kvm-x86 mmu, with the tweaks mentioned earlier. Thanks! [1/1] KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE https://github.com/kvm-x86/linux/commit/aca48556c592 -- https://github.com/kvm-x86/linux/tree/next
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index d078157e62aa..e169e7ee6c40 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -530,6 +530,31 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, kvm_set_pfn_accessed(spte_to_pfn(old_spte)); } +static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte) +{ + u64 *sptep = rcu_dereference(iter->sptep); + + /* + * The caller is responsible for ensuring the old SPTE is not a REMOVED + * SPTE. KVM should never attempt to zap or manipulate a REMOVED SPTE, + * and pre-checking before inserting a new SPTE is advantageous as it + * avoids unnecessary work. + */ + WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte)); + + /* + * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and + * does not hold the mmu_lock. On failure, i.e. if a different logical + * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with + * the current value, so the caller operates on fresh data, e.g. if it + * retries tdp_mmu_set_spte_atomic() + */ + if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) + return -EBUSY; + + return 0; +} + /* * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically * and handle the associated bookkeeping. Do not mark the page dirty @@ -551,27 +576,13 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - u64 *sptep = rcu_dereference(iter->sptep); - - /* - * The caller is responsible for ensuring the old SPTE is not a REMOVED - * SPTE. KVM should never attempt to zap or manipulate a REMOVED SPTE, - * and pre-checking before inserting a new SPTE is advantageous as it - * avoids unnecessary work. - */ - WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte)); + int ret; lockdep_assert_held_read(&kvm->mmu_lock); - /* - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and - * does not hold the mmu_lock. On failure, i.e. if a different logical - * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with - * the current value, so the caller operates on fresh data, e.g. if it - * retries tdp_mmu_set_spte_atomic() - */ - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) - return -EBUSY; + ret = __tdp_mmu_set_spte_atomic(iter, new_spte); + if (ret) + return ret; handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, new_spte, iter->level, true); @@ -584,13 +595,17 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm, { int ret; + lockdep_assert_held_read(&kvm->mmu_lock); + /* - * Freeze the SPTE by setting it to a special, - * non-present value. This will stop other threads from - * immediately installing a present entry in its place - * before the TLBs are flushed. + * Freeze the SPTE by setting it to a special, non-present value. This + * will stop other threads from immediately installing a present entry + * in its place before the TLBs are flushed. + * + * Delay processing of the zapped SPTE until after TLBs are flushed and + * the REMOVED_SPTE is replaced (see below). */ - ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE); + ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE); if (ret) return ret; @@ -599,12 +614,26 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm, /* * No other thread can overwrite the removed SPTE as they must either * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not - * overwrite the special removed SPTE value. No bookkeeping is needed - * here since the SPTE is going from non-present to non-present. Use - * the raw write helper to avoid an unnecessary check on volatile bits. + * overwrite the special removed SPTE value. Use the raw write helper to + * avoid an unnecessary check on volatile bits. */ __kvm_tdp_mmu_write_spte(iter->sptep, 0); + /* + * Process the zapped SPTE after flushing TLBs and replacing + * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are + * blocked by the REMOVED_SPTE and reduces contention on the child + * SPTEs. + * + * This should be safe because KVM does not depend on any of the + * processing completing before a new SPTE is installed to map a given + * GFN. Case in point, kvm_mmu_zap_all_fast() can result in KVM + * processing all SPTEs in a given root after vCPUs create mappings in + * a new root. + */ + handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, + 0, iter->level, true); + return 0; }
Process SPTEs zapped under the read-lock after the TLB flush and replacement of REMOVED_SPTE with 0. This minimizes the contention on the child SPTEs (if zapping an SPTE that points to a page table) and minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to process a 1GiB region mapped with 4KiB entries, e.g. when disabling dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a vCPU accesses the 1GiB region being zapped it will be stalled until KVM finishes processing the SPTE and replaces the REMOVED_SPTE with 0. Re-ordering the processing does speed up the atomic-zaps somewhat, but the main benefit is avoiding blocking vCPU threads. Before: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 509.765146313s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 8 |** | 256 -> 511 : 68 |****************** | 512 -> 1023 : 129 |********************************** | 1024 -> 2047 : 151 |****************************************| 2048 -> 4095 : 60 |*************** | After: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 336.516838548s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 12 |** | 256 -> 511 : 166 |****************************************| 512 -> 1023 : 101 |************************ | 1024 -> 2047 : 137 |********************************* | KVM's processing of collapsible SPTEs is still extremely slow and can be improved. For example, a significant amount of time is spent calling kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is redundant when processing SPTEs that all map the folio. Cc: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> --- arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------ 1 file changed, 55 insertions(+), 26 deletions(-) base-commit: 0c64952fec3ea01cb5b09f00134200f3e7ab40d5