mbox series

[00/24] Allow parallel page faults with TDP MMU

Message ID 20210112181041.356734-1-bgardon@google.com (mailing list archive)
Headers show
Series Allow parallel page faults with TDP MMU | expand

Message

Ben Gardon Jan. 12, 2021, 6:10 p.m. UTC
The TDP MMU was implemented to simplify and improve the performance of
KVM's memory management on modern hardware with TDP (EPT / NPT). To build
on the existing performance improvements of the TDP MMU, add the ability
to handle vCPU page faults in parallel. In the current implementation,
vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
largest source of MMU lock contention on VMs with many vCPUs. This
contention, and the resulting page fault latency, can soft-lock guests
and degrade performance. Handling page faults in parallel is especially
useful when booting VMs, enabling dirty logging, and handling demand
paging. In all these cases vCPUs are constantly incurring  page faults on
each new page accessed.

Broadly, the following changes were required to allow parallel page
faults:
-- Contention detection and yielding added to rwlocks to bring them up to
   feature parity with spin locks, at least as far as the use of the MMU
   lock is concerned.
-- TDP MMU page table memory is protected with RCU and freed in RCU
   callbacks to allow multiple threads to operate on that memory
   concurrently.
-- When the TDP MMU is enabled, a rwlock is used instead of a spin lock on
   x86. This allows the page fault handlers to acquire the MMU lock in read
   mode and handle page faults in parallel while other operations maintain
   exclusive use of the lock by acquiring it in write mode.
-- An additional lock is added to protect some data structures needed by
   the page fault handlers, for relatively infrequent operations.
-- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
   and some page fault handler operations are modified slightly to work
   concurrently with other threads.

This series also contains a few bug fixes and optimizations, related to
the above, but not strictly part of enabling parallel page fault handling.

Performance testing:
The KVM selftests dirty_log_perf_test demonstrates the performance
improvements associated with this patch series. The dirty_log_perf test
was run on a two socket Indus Skylake, with a VM with 96 vCPUs.
5 get-dirty-log iterations were run. Each test was run 3 times and the
results averaged. The test was conducted with 3 different variables:
Overlapping versus partitioned memory
With overlapping memory vCPUs are more likely to incur retries handling
parallel page faults, so the TDP MMU with parallel page faults is expected
to fare the worst in this situation.
Partitioned memory between vCPUs is a best case for parallel page faults
with the TDP MMU as it should minimize contention and retries.
When running with partitioned memory, 3G was allocated for each vCPU's
data region. When running with overlapping memory accesses, a total of 6G
was allocated for the VM's data region. This meant that the VM was much
smaller overall, but each vCPU had more memory to access. Since the VMs
were very different in size, the results cannot be reliably compared. The
VM sizes were chosen to balance test runtime and repeatability of results.
The option to overlap memory accesses will be added to dirty_log_perf_test
in a (near-)future series.
With this patch set applied versus without
In these tests the series was applied on commit:
9f1abbe97c08 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
That commit was also used as the baseline.
TDP MMU enabled versus disabled
This is primarily included to ensure that this series does not regress
performance with the TDP MMU disabled.

Does this series improve performance with the TDP MMU enabled?

Baseline, TDP MMU enabled, partitioned accesses:
Populate memory time (s)		110.193
Enabling dirty logging time (s)		4.829
Dirty memory time (s)			3.949
Get dirty log time (s)			0.822
Disabling dirty logging time (s)	2.995
Parallel PFs series, TDP MMU enabled, partitioned accesses:
Populate memory time (s)		16.112
Enabling dirty logging time (s)		7.057
Dirty memory time (s)			0.275
Get dirty log time (s)			5.468
Disabling dirty logging time (s)	3.088

This scenario demonstrates the big win in this series: an 85% reduction in
the time taken to populate memory! Note that the time taken to dirty memory
is much shorter and the time to get the dirty log higher with this series.

Baseline, TDP MMU enabled, overlapping accesses:
Populate memory time (s)		117.31
Enabling dirty logging time (s)		0.191
Dirty memory time (s)			0.193
Get dirty log time (s)			2.33
Disabling dirty logging time (s)	0.059
Parallel PFs series, TDP MMU enabled, overlapping accesses:
Populate memory time (s)		141.155
Enabling dirty logging time (s)		0.245
Dirty memory time (s)			0.236
Get dirty log time (s)			2.35
Disabling dirty logging time (s)	0.075

With overlapping accesses, we can see that this parallel page faults
series actually reduces performance when populating memory. In profiling,
it appeared that most of the time was spent in get_user_pages, so it's
possible the extra concurrency hit the main MM subsystem harder, creating
contention there.

Does this series degrade performance with the TDP MMU disabled?

Baseline, TDP MMU disabled, partitioned accesses:
Populate memory time (s)		110.193
Enabling dirty logging time (s)		4.829
Dirty memory time (s)			3.949
Get dirty log time (s)			0.822
Disabling dirty logging time (s)	2.995
Parallel PFs series, TDP MMU disabled, partitioned accesses:
Populate memory time (s)		110.917
Enabling dirty logging time (s)		5.196
Dirty memory time (s)			4.559
Get dirty log time (s)			0.879
Disabling dirty logging time (s)	3.278

Here we can see that the parallel PFs series appears to have made enabling
and disabling dirty logging, and dirtying memory slightly slower. It's
possible that this is a result of additional checks around MMU lock
acquisition.

Baseline, TDP MMU disabled, overlapping accesses:
Populate memory time (s)		103.115
Enabling dirty logging time (s)		0.222
Dirty memory time (s)			0.189
Get dirty log time (s)			2.341
Disabling dirty logging time (s)	0.126
Parallel PFs series, TDP MMU disabled, overlapping accesses:
Populate memory time (s)		85.392
Enabling dirty logging time (s)		0.224
Dirty memory time (s)			0.201
Get dirty log time (s)			2.363
Disabling dirty logging time (s)	0.131

From the above results we can see that the parallel PF series only had a
significant effect on the population time, with overlapping accesses and
the TDP MMU disabled. It is not currently known what in this series caused
the improvement.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works
Cross-compilation was also checked for PowerPC and ARM64.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Ben Gardon (24):
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  kvm: x86/mmu: change TDP MMU yield function returns to match
    cond_resched
  kvm: x86/mmu: Fix yielding in TDP MMU
  kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  kvm: x86/mmu: Add comment on __tdp_mmu_set_spte
  kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  kvm: x86/mmu: Factor out handle disconnected pt
  kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section
  kvm: x86/kvm: RCU dereference tdp mmu page table links
  kvm: x86/mmu: Only free tdp_mmu pages after a grace period
  kvm: mmu: Wrap mmu_lock lock / unlock in a function
  kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  kvm: mmu: Wrap mmu_lock assertions
  kvm: mmu: Move mmu_lock to struct kvm_arch
  kvm: x86/mmu: Use an rwlock for the x86 TDP MMU
  kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  kvm: x86/mmu: Add atomic option for setting SPTEs
  kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  kvm: x86/mmu: Freeze SPTEs in disconnected pages
  kvm: x86/mmu: Allow parallel page faults for the TDP MMU

 Documentation/virt/kvm/locking.rst       |   2 +-
 arch/arm64/include/asm/kvm_host.h        |   2 +
 arch/arm64/kvm/arm.c                     |   2 +
 arch/arm64/kvm/mmu.c                     |  40 +-
 arch/mips/include/asm/kvm_host.h         |   2 +
 arch/mips/kvm/mips.c                     |  10 +-
 arch/mips/kvm/mmu.c                      |  20 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   7 +-
 arch/powerpc/include/asm/kvm_host.h      |   2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c    |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  12 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  32 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |   4 +-
 arch/powerpc/kvm/book3s_hv.c             |   8 +-
 arch/powerpc/kvm/book3s_hv_nested.c      |  59 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  14 +-
 arch/powerpc/kvm/book3s_mmu_hpte.c       |  10 +-
 arch/powerpc/kvm/e500_mmu_host.c         |   6 +-
 arch/powerpc/kvm/powerpc.c               |   2 +
 arch/s390/include/asm/kvm_host.h         |   2 +
 arch/s390/kvm/kvm-s390.c                 |   2 +
 arch/x86/include/asm/kvm_host.h          |  23 +
 arch/x86/kvm/mmu/mmu.c                   | 189 ++++++--
 arch/x86/kvm/mmu/mmu_internal.h          |  16 +-
 arch/x86/kvm/mmu/page_track.c            |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h           |   8 +-
 arch/x86/kvm/mmu/spte.h                  |  16 +-
 arch/x86/kvm/mmu/tdp_iter.c              |   6 +-
 arch/x86/kvm/mmu/tdp_mmu.c               | 540 +++++++++++++++++++----
 arch/x86/kvm/x86.c                       |   4 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c         |  12 +-
 include/asm-generic/qrwlock.h            |  24 +-
 include/linux/kvm_host.h                 |   7 +-
 include/linux/rwlock.h                   |   7 +
 include/linux/sched.h                    |  29 ++
 kernel/sched/core.c                      |  40 ++
 virt/kvm/dirty_ring.c                    |   4 +-
 virt/kvm/kvm_main.c                      |  58 ++-
 38 files changed, 938 insertions(+), 295 deletions(-)