mbox series

[0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU

Message ID 20210611235701.3941724-1-dmatlack@google.com (mailing list archive)
Headers show
Series KVM: x86/mmu: Fast page fault support for the TDP MMU | expand

Message

David Matlack June 11, 2021, 11:56 p.m. UTC
This patch series adds support for the TDP MMU in the fast_page_fault
path, which enables certain write-protection and access tracking faults
to be handled without taking the KVM MMU lock. This series brings the
performance of these faults up to par with the legacy MMU.

Design
------

This series enables the existing fast_page_fault handler to operate
independent of whether the TDP MMU is enabled or not by abstracting out
the details behind a new lockless page walk API. I tried an alterative
design where the TDP MMU provided its own fast_page_fault handler and
there was a shared helper code for modifying the PTE. However I decided
against this approach because it forced me to duplicate the retry loop,
resulted in calls back and forth between mmu.c and tdp_mmu.c, and
passing around the RET_PF_* values got complicated fast.

Testing
-------

Setup:
 - Ran all tests on a Cascade Lake machine.
 - Ran all tests with kvm_intel.eptad=N, kvm_intel.pml=N, kvm.tdp_mmu=N.
 - Ran all tests with kvm_intel.eptad=N, kvm_intel.pml=N, kvm.tdp_mmu=Y.

Tests:
 - Ran ll KVM selftests with default arguments
 - ./access_tracking_perf_test -v 4
 - ./access_tracking_perf_test -v 4 -o
 - ./access_tracking_perf_test -v 4 -s anonymous_thp
 - ./access_tracking_perf_test -v 4 -s anonymous_thp -o
 - ./access_tracking_perf_test -v 64
 - ./dirty_log_perf_test -v 4 -s anonymous_thp
 - ./dirty_log_perf_test -v 4 -s anonymous_thp -o
 - ./dirty_log_perf_test -v 4 -o
 - ./dirty_log_perf_test -v 64

For certain tests I also collected the fast_page_fault tracepoint to
manually make sure it was getting triggered properly:

  perf record -e kvmmmu:fast_page_fault --filter "old_spte != 0" -- <test>

Performance Results
-------------------

To measure performance I ran dirty_log_perf_test and
access_tracking_perf_test with 64 vCPUs. For dirty_log_perf_test
performance is measured by "Iteration 2 dirty memory time", the time it
takes for all vCPUs to write to their memory after it has been
write-protected. For access_tracking_perf_test performance is measured
by "Writing to idle memory", the time it takes for all vCPUs to write to
their memory after it has been access-protected.

Both metrics improved by 10x:

Metric                            | tdp_mmu=Y before   | tdp_mmu=Y after
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 3.545234984s       | 0.312197959s
Writing to idle memory            | 3.249645416s       | 0.298275545s

The TDP MMU is now on par with the legacy MMU:

Metric                            | tdp_mmu=N          | tdp_mmu=Y
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 0.300802793s       | 0.312197959s
Writing to idle memory            | 0.295591860s       | 0.298275545s

David Matlack (8):
  KVM: x86/mmu: Refactor is_tdp_mmu_root()
  KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  KVM: x86/mmu: Common API for lockless shadow page walks
  KVM: x86/mmu: Also record spteps in shadow_page_walk
  KVM: x86/mmu: fast_page_fault support for the TDP MMU
  KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  KVM: selftests: Introduce access_tracking_perf_test

 arch/x86/kvm/mmu/mmu.c                        | 159 +++----
 arch/x86/kvm/mmu/mmu_internal.h               |  18 +
 arch/x86/kvm/mmu/mmutrace.h                   |   3 +
 arch/x86/kvm/mmu/tdp_mmu.c                    |  37 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |  14 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../selftests/kvm/access_tracking_perf_test.c | 419 ++++++++++++++++++
 .../selftests/kvm/dirty_log_perf_test.c       |   1 +
 9 files changed, 559 insertions(+), 96 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c

Comments

Paolo Bonzini June 14, 2021, 9:54 a.m. UTC | #1
On 12/06/21 01:56, David Matlack wrote:
> This patch series adds support for the TDP MMU in the fast_page_fault
> path, which enables certain write-protection and access tracking faults
> to be handled without taking the KVM MMU lock. This series brings the
> performance of these faults up to par with the legacy MMU.

Hi David,

I have one very basic question: is the speedup due to lock contention, 
or to cacheline bouncing, or something else altogether? In other words, 
what do the profiles look like before vs. after these patches?

Thanks,

Paolo
David Matlack June 14, 2021, 9:08 p.m. UTC | #2
On Mon, Jun 14, 2021 at 11:54:59AM +0200, Paolo Bonzini wrote:
> On 12/06/21 01:56, David Matlack wrote:
> > This patch series adds support for the TDP MMU in the fast_page_fault
> > path, which enables certain write-protection and access tracking faults
> > to be handled without taking the KVM MMU lock. This series brings the
> > performance of these faults up to par with the legacy MMU.
> 
> Hi David,
> 
> I have one very basic question: is the speedup due to lock contention, or to
> cacheline bouncing, or something else altogether? In other words, what do
> the profiles look like before vs. after these patches?

The speed up comes from a combination of:
 - Less time spent in kvm_vcpu_gfn_to_memslot.
 - Less lock contention on the MMU lock in read mode.

Before:

  Overhead  Symbol
-   45.59%  [k] kvm_vcpu_gfn_to_memslot
   - 45.57% kvm_vcpu_gfn_to_memslot
      - 29.25% kvm_page_track_is_active
         + 15.90% direct_page_fault
         + 13.35% mmu_need_write_protect
      + 9.10% kvm_mmu_hugepage_adjust
      + 7.20% try_async_pf
+   18.16%  [k] _raw_read_lock
+   10.57%  [k] direct_page_fault
+    8.77%  [k] handle_changed_spte_dirty_log
+    4.65%  [k] mark_page_dirty_in_slot
     1.62%  [.] run_test
+    1.35%  [k] x86_virt_spec_ctrl
+    1.18%  [k] try_grab_compound_head
[...]

After:

  Overhead  Symbol
+   26.23%  [k] x86_virt_spec_ctrl
+   15.93%  [k] vmx_vmexit
+    6.33%  [k] vmx_vcpu_run
+    4.31%  [k] vcpu_enter_guest
+    3.71%  [k] tdp_iter_next
+    3.47%  [k] __vmx_vcpu_run
+    2.92%  [k] kvm_vcpu_gfn_to_memslot
+    2.71%  [k] vcpu_run
+    2.71%  [k] fast_page_fault
+    2.51%  [k] kvm_vcpu_mark_page_dirty

(Both profiles were captured during "Iteration 2 dirty memory" of
dirty_log_perf_test.)

Related to the kvm_vcpu_gfn_to_memslot overhead: I actually have a set of
patches from Ben I am planning to send soon that will reduce the number of
redundant gfn-to-memslot lookups in the page fault path.

> 
> Thanks,
> 
> Paolo
>
Paolo Bonzini June 15, 2021, 7:16 a.m. UTC | #3
On 14/06/21 23:08, David Matlack wrote:
> I actually have a set of
> patches from Ben I am planning to send soon that will reduce the number of
> redundant gfn-to-memslot lookups in the page fault path.

That seems to be a possible 5.14 candidate, while this series is 
probably a bit too much for now.

Paolo
David Matlack June 16, 2021, 7:27 p.m. UTC | #4
On Tue, Jun 15, 2021 at 09:16:00AM +0200, Paolo Bonzini wrote:
> On 14/06/21 23:08, David Matlack wrote:
> > I actually have a set of
> > patches from Ben I am planning to send soon that will reduce the number of
> > redundant gfn-to-memslot lookups in the page fault path.
> 
> That seems to be a possible 5.14 candidate, while this series is probably a
> bit too much for now.

Thanks for the feedback. I am not in a rush to get either series into
5.14 so that sounds fine with me. Here is how I am planning to proceed:

1. Send a new series with the cleanups to is_tdp_mmu_root Sean suggested
   in patch 1/8 [1].
2. Send v2 of the TDP MMU Fast Page Fault series without patch 1/8.
3. Send out the memslot lookup optimization series.

Does that sound reasonable to you? Do you have any reservations with
taking (2) before (3)?

[1] https://lore.kernel.org/kvm/YMepDK40DLkD4DSy@google.com/

> 
> Paolo
>
Paolo Bonzini June 16, 2021, 7:31 p.m. UTC | #5
On 16/06/21 21:27, David Matlack wrote:
>>> I actually have a set of
>>> patches from Ben I am planning to send soon that will reduce the number of
>>> redundant gfn-to-memslot lookups in the page fault path.
>> That seems to be a possible 5.14 candidate, while this series is probably a
>> bit too much for now.
> Thanks for the feedback. I am not in a rush to get either series into
> 5.14 so that sounds fine with me. Here is how I am planning to proceed:
> 
> 1. Send a new series with the cleanups to is_tdp_mmu_root Sean suggested
>     in patch 1/8 [1].
> 2. Send v2 of the TDP MMU Fast Page Fault series without patch 1/8.
> 3. Send out the memslot lookup optimization series.
> 
> Does that sound reasonable to you? Do you have any reservations with
> taking (2) before (3)?
> 
> [1]https://lore.kernel.org/kvm/YMepDK40DLkD4DSy@google.com/

They all seem reasonably independent, so use the order that is easier 
for you.

Paolo