mbox series

[RFC,00/19] KVM: x86/mmu: Optimize disabling dirty logging

Message ID 20211110223010.1392399-1-bgardon@google.com (mailing list archive)
Headers show
Series KVM: x86/mmu: Optimize disabling dirty logging | expand

Message

Ben Gardon Nov. 10, 2021, 10:29 p.m. UTC
Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging
with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This
series optimizes TLB flushes and introduces in-place large page
promotion, to bring the disable dirty log time down to ~2 seconds.

Testing:
Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
series introduced no new failures.

Performance:
To collect these results I needed to apply Mingwei's patch
"selftests: KVM: align guest physical memory base address to 1GB"
https://lkml.org/lkml/2021/8/29/310
David Matlack is going to send out an updated version of that patch soon.

Without this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 10.966500447s
Enabling dirty logging time: 0.002068737s

Iteration 1 dirty memory time: 0.047556280s
Iteration 1 get dirty log time: 0.001253914s
Iteration 1 clear dirty log time: 0.049716661s
Iteration 2 dirty memory time: 3.679662016s
Iteration 2 get dirty log time: 0.000659546s
Iteration 2 clear dirty log time: 1.834329322s
Disabling dirty logging time: 45.738439510s
Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)

Without this series, Legacy MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.664750666s
Enabling dirty logging time: 0.002025510s

Iteration 1 dirty memory time: 0.046240875s
Iteration 1 get dirty log time: 0.001864342s
Iteration 1 clear dirty log time: 0.170243637s
Iteration 2 dirty memory time: 31.571088701s
Iteration 2 get dirty log time: 0.000626245s
Iteration 2 clear dirty log time: 1.294817729s
Disabling dirty logging time: 3.566831573s
Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)

With this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.016653537s
Enabling dirty logging time: 0.001992860s

Iteration 1 dirty memory time: 0.046701599s
Iteration 1 get dirty log time: 0.001214806s
Iteration 1 clear dirty log time: 0.049519923s
Iteration 2 dirty memory time: 3.581931268s
Iteration 2 get dirty log time: 0.000621383s
Iteration 2 clear dirty log time: 1.894597059s
Disabling dirty logging time: 1.950542092s
Get dirty log over 2 iterations took 0.001836189s. (Avg 0.000918094s/iteration)
Clear dirty log over 2 iterations took 1.944116982s. (Avg 0.972058491s/iteration)

Patch breakdown:
Patch 1 is a fix for a bug in the way the TBP MMU issues TLB flushes
Patches 2-5 eliminate many unnecessary TLB flushes through better batching
Patches 6-12 remove the need for a vCPU pointer to make_spte
Patches 13-18 are small refactors in perparation for patch 19
Patch 19 implements in-place largepage promotion when disabling dirty logging

Ben Gardon (19):
  KVM: x86/mmu: Fix TLB flush range when handling disconnected pt
  KVM: x86/mmu: Batch TLB flushes for a single zap
  KVM: x86/mmu: Factor flush and free up when zapping under MMU write
    lock
  KVM: x86/mmu: Yield while processing disconnected_sps
  KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
  KVM: x86/mmu: Introduce vcpu_make_spte
  KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
  KVM: x86/mmu: Factor mt_mask out of make_spte
  KVM: x86/mmu: Remove need for a vcpu from
    kvm_slot_page_track_is_active
  KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
  KVM: x86/mmu: Factor shadow_zero_check out of make_spte
  KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
  KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  KVM: x86/mmu: Propagate memslot const qualifier
  KVM: x86/MMU: Refactor vmx_get_mt_mask
  KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
    on vcpu
  KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
  KVM: x86/mmu: Promote pages in-place when disabling dirty logging

 arch/x86/include/asm/kvm-x86-ops.h    |   1 +
 arch/x86/include/asm/kvm_host.h       |   2 +
 arch/x86/include/asm/kvm_page_track.h |   6 +-
 arch/x86/kvm/mmu/mmu.c                |  45 +++---
 arch/x86/kvm/mmu/mmu_internal.h       |   6 +-
 arch/x86/kvm/mmu/page_track.c         |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h        |   6 +-
 arch/x86/kvm/mmu/spte.c               |  43 +++--
 arch/x86/kvm/mmu/spte.h               |  17 +-
 arch/x86/kvm/mmu/tdp_mmu.c            | 217 +++++++++++++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.h            |   5 +-
 arch/x86/kvm/svm/svm.c                |   8 +
 arch/x86/kvm/vmx/vmx.c                |  40 +++--
 include/linux/kvm_host.h              |  10 +-
 virt/kvm/kvm_main.c                   |  12 +-
 15 files changed, 302 insertions(+), 124 deletions(-)

Comments

Ben Gardon Nov. 15, 2021, 9:24 p.m. UTC | #1
On Wed, Nov 10, 2021 at 2:30 PM Ben Gardon <bgardon@google.com> wrote:
>
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~2 seconds.
>
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
> series introduced no new failures.
>
> Performance:
> To collect these results I needed to apply Mingwei's patch
> "selftests: KVM: align guest physical memory base address to 1GB"
> https://lkml.org/lkml/2021/8/29/310
> David Matlack is going to send out an updated version of that patch soon.
>
> Without this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 10.966500447s
> Enabling dirty logging time: 0.002068737s
>
> Iteration 1 dirty memory time: 0.047556280s
> Iteration 1 get dirty log time: 0.001253914s
> Iteration 1 clear dirty log time: 0.049716661s
> Iteration 2 dirty memory time: 3.679662016s
> Iteration 2 get dirty log time: 0.000659546s
> Iteration 2 clear dirty log time: 1.834329322s
> Disabling dirty logging time: 45.738439510s
> Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
> Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)
>
> Without this series, Legacy MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.664750666s
> Enabling dirty logging time: 0.002025510s
>
> Iteration 1 dirty memory time: 0.046240875s
> Iteration 1 get dirty log time: 0.001864342s
> Iteration 1 clear dirty log time: 0.170243637s
> Iteration 2 dirty memory time: 31.571088701s
> Iteration 2 get dirty log time: 0.000626245s
> Iteration 2 clear dirty log time: 1.294817729s
> Disabling dirty logging time: 3.566831573s
> Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
> Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)
>
> With this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.016653537s
> Enabling dirty logging time: 0.001992860s
>
> Iteration 1 dirty memory time: 0.046701599s
> Iteration 1 get dirty log time: 0.001214806s
> Iteration 1 clear dirty log time: 0.049519923s
> Iteration 2 dirty memory time: 3.581931268s
> Iteration 2 get dirty log time: 0.000621383s
> Iteration 2 clear dirty log time: 1.894597059s
> Disabling dirty logging time: 1.950542092s
> Get dirty log over 2 iterations took 0.001836189s. (Avg 0.000918094s/iteration)
> Clear dirty log over 2 iterations took 1.944116982s. (Avg 0.972058491s/iteration)
>
> Patch breakdown:
> Patch 1 is a fix for a bug in the way the TBP MMU issues TLB flushes
> Patches 2-5 eliminate many unnecessary TLB flushes through better batching
> Patches 6-12 remove the need for a vCPU pointer to make_spte
> Patches 13-18 are small refactors in perparation for patch 19
> Patch 19 implements in-place largepage promotion when disabling dirty logging
>
> Ben Gardon (19):
>   KVM: x86/mmu: Fix TLB flush range when handling disconnected pt
>   KVM: x86/mmu: Batch TLB flushes for a single zap
>   KVM: x86/mmu: Factor flush and free up when zapping under MMU write
>     lock
>   KVM: x86/mmu: Yield while processing disconnected_sps
>   KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
>   KVM: x86/mmu: Introduce vcpu_make_spte
>   KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
>   KVM: x86/mmu: Factor mt_mask out of make_spte
>   KVM: x86/mmu: Remove need for a vcpu from
>     kvm_slot_page_track_is_active
>   KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
>   KVM: x86/mmu: Factor shadow_zero_check out of make_spte
>   KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
>   KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
>   KVM: x86/mmu: Propagate memslot const qualifier
>   KVM: x86/MMU: Refactor vmx_get_mt_mask
>   KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
>     on vcpu
>   KVM: x86/mmu: Add try_get_mt_mask to x86_ops
>   KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
>   KVM: x86/mmu: Promote pages in-place when disabling dirty logging
>
>  arch/x86/include/asm/kvm-x86-ops.h    |   1 +
>  arch/x86/include/asm/kvm_host.h       |   2 +
>  arch/x86/include/asm/kvm_page_track.h |   6 +-
>  arch/x86/kvm/mmu/mmu.c                |  45 +++---
>  arch/x86/kvm/mmu/mmu_internal.h       |   6 +-
>  arch/x86/kvm/mmu/page_track.c         |   8 +-
>  arch/x86/kvm/mmu/paging_tmpl.h        |   6 +-
>  arch/x86/kvm/mmu/spte.c               |  43 +++--
>  arch/x86/kvm/mmu/spte.h               |  17 +-
>  arch/x86/kvm/mmu/tdp_mmu.c            | 217 +++++++++++++++++++++-----
>  arch/x86/kvm/mmu/tdp_mmu.h            |   5 +-
>  arch/x86/kvm/svm/svm.c                |   8 +
>  arch/x86/kvm/vmx/vmx.c                |  40 +++--
>  include/linux/kvm_host.h              |  10 +-
>  virt/kvm/kvm_main.c                   |  12 +-
>  15 files changed, 302 insertions(+), 124 deletions(-)
>
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

In a conversation with Sean today, he expressed interest in taking
over patches 2-4 from this series as it conflicted with another fix he
was working on.
I'll leave it to him to incorporate the feedback on these patches.
In the meantime, I've sent another iteration of patch 1 from this
series (a standalone bug fix) and will work on putting together
another version of patches 5-19.