mbox series

[v2,00/28] Allow parallel MMU operations with TDP MMU

Message ID 20210202185734.1680553-1-bgardon@google.com (mailing list archive)
Headers show
Series Allow parallel MMU operations with TDP MMU | expand

Message

Ben Gardon Feb. 2, 2021, 6:57 p.m. UTC
The TDP MMU was implemented to simplify and improve the performance of
KVM's memory management on modern hardware with TDP (EPT / NPT). To build
on the existing performance improvements of the TDP MMU, add the ability
to handle vCPU page faults, enabling and disabling dirty logging, and
removing mappings, in parallel. In the current implementation,
vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
largest source of MMU lock contention on VMs with many vCPUs. This
contention, and the resulting page fault latency, can soft-lock guests
and degrade performance. Handling page faults in parallel is especially
useful when booting VMs, enabling dirty logging, and handling demand
paging. In all these cases vCPUs are constantly incurring  page faults on
each new page accessed.

Broadly, the following changes were required to allow parallel page
faults (and other MMU operations):
-- Contention detection and yielding added to rwlocks to bring them up to
   feature parity with spin locks, at least as far as the use of the MMU
   lock is concerned.
-- TDP MMU page table memory is protected with RCU and freed in RCU
   callbacks to allow multiple threads to operate on that memory
   concurrently.
-- The MMU lock was changed to an rwlock on x86. This allows the page
   fault handlers to acquire the MMU lock in read mode and handle page
   faults in parallel, and other operations to maintain exclusive use of
   the lock by acquiring it in write mode.
-- An additional lock is added to protect some data structures needed by
   the page fault handlers, for relatively infrequent operations.
-- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
   and some page fault handler operations are modified slightly to work
   concurrently with other threads.

This series also contains a few bug fixes and optimizations, related to
the above, but not strictly part of enabling parallel page fault handling.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Changelog v1 -> v2:
- Removed the MMU lock union + using a spinlock when the TDP MMU is disabled
- Merged RCU commits
- Extended additional MMU operations to operate in parallel
- Ammended dirty log perf test to cover newly parallelized code paths
- Misc refactorings (see changelogs for individual commits)
- Big thanks to Sean and Paolo for their thorough review of v1

Ben Gardon (28):
  KVM: x86/mmu: change TDP MMU yield function returns to match
    cond_resched
  KVM: x86/mmu: Add comment on __tdp_mmu_set_spte
  KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory
  KVM: x86/mmu: Factor out handling of removed page tables
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages
  KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
  KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched
  KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn
  KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed
  KVM: x86/mmu: Skip no-op changes in TDP MMU functions
  KVM: x86/mmu: Clear dirtied pages mask bit before early break
  KVM: x86/mmu: Protect TDP MMU page table memory with RCU
  KVM: x86/mmu: Use an rwlock for the x86 MMU
  KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages
  KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  KVM: x86/mmu: Mark SPTEs in disconnected pages as removed
  KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
    lock
  KVM: selftests: Add backing src parameter to dirty_log_perf_test
  KVM: selftests: Disable dirty logging with vCPUs running

 arch/x86/include/asm/kvm_host.h               |  15 +
 arch/x86/kvm/mmu/mmu.c                        | 120 +--
 arch/x86/kvm/mmu/mmu_internal.h               |   9 +-
 arch/x86/kvm/mmu/page_track.c                 |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |   8 +-
 arch/x86/kvm/mmu/spte.h                       |  21 +-
 arch/x86/kvm/mmu/tdp_iter.c                   |  46 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  21 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 741 ++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +-
 arch/x86/kvm/x86.c                            |   4 +-
 include/asm-generic/qrwlock.h                 |  24 +-
 include/linux/kvm_host.h                      |   5 +
 include/linux/rwlock.h                        |   7 +
 include/linux/sched.h                         |  29 +
 kernel/sched/core.c                           |  40 +
 .../selftests/kvm/demand_paging_test.c        |   3 +-
 .../selftests/kvm/dirty_log_perf_test.c       |  25 +-
 .../testing/selftests/kvm/include/kvm_util.h  |   6 -
 .../selftests/kvm/include/perf_test_util.h    |   3 +-
 .../testing/selftests/kvm/include/test_util.h |  14 +
 .../selftests/kvm/lib/perf_test_util.c        |   6 +-
 tools/testing/selftests/kvm/lib/test_util.c   |  29 +
 virt/kvm/dirty_ring.c                         |  10 +
 virt/kvm/kvm_main.c                           |  46 +-
 25 files changed, 963 insertions(+), 282 deletions(-)

Comments

Paolo Bonzini Feb. 3, 2021, 11 a.m. UTC | #1
On 02/02/21 19:57, Ben Gardon wrote:
> The TDP MMU was implemented to simplify and improve the performance of
> KVM's memory management on modern hardware with TDP (EPT / NPT). To build
> on the existing performance improvements of the TDP MMU, add the ability
> to handle vCPU page faults, enabling and disabling dirty logging, and
> removing mappings, in parallel. In the current implementation,
> vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
> largest source of MMU lock contention on VMs with many vCPUs. This
> contention, and the resulting page fault latency, can soft-lock guests
> and degrade performance. Handling page faults in parallel is especially
> useful when booting VMs, enabling dirty logging, and handling demand
> paging. In all these cases vCPUs are constantly incurring  page faults on
> each new page accessed.
> 
> Broadly, the following changes were required to allow parallel page
> faults (and other MMU operations):
> -- Contention detection and yielding added to rwlocks to bring them up to
>     feature parity with spin locks, at least as far as the use of the MMU
>     lock is concerned.
> -- TDP MMU page table memory is protected with RCU and freed in RCU
>     callbacks to allow multiple threads to operate on that memory
>     concurrently.
> -- The MMU lock was changed to an rwlock on x86. This allows the page
>     fault handlers to acquire the MMU lock in read mode and handle page
>     faults in parallel, and other operations to maintain exclusive use of
>     the lock by acquiring it in write mode.
> -- An additional lock is added to protect some data structures needed by
>     the page fault handlers, for relatively infrequent operations.
> -- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
>     and some page fault handler operations are modified slightly to work
>     concurrently with other threads.
> 
> This series also contains a few bug fixes and optimizations, related to
> the above, but not strictly part of enabling parallel page fault handling.
> 
> Correctness testing:
> The following tests were performed with an SMP kernel and DBX kernel on an
> Intel Skylake machine. The tests were run both with and without the TDP
> MMU enabled.
> -- This series introduces no new failures in kvm-unit-tests
> SMP + no TDP MMU no new failures
> SMP + TDP MMU no new failures
> DBX + no TDP MMU no new failures
> DBX + TDP MMU no new failures

What's DBX?  Lockdep etc.?

> -- All KVM selftests behave as expected
> SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
> SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
> (./x86_64/vmx_preemption_timer_test also fails without this patch set,
> both with the TDP MMU on and off.)

Yes, it's flaky.  It depends on your host.

> DBX + no TDP MMU all pass
> DBX + TDP MMU all pass
> -- A VM can be booted running a Debian 9 and all memory accessed
> SMP + no TDP MMU works
> SMP + TDP MMU works
> DBX + no TDP MMU works
> DBX + TDP MMU works
> 
> This series can be viewed in Gerrit at:
> https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Looks good!  I'll wait for a few days of reviews, but I'd like to queue 
this for 5.12 and I plan to make it the default in 5.13 or 5.12-rc 
(depending on when I can ask Red Hat QE to give it a shake).

It also needs more documentation though.  I'll do that myself based on 
your KVM Forum talk so that I can teach myself more of it.

Paolo
Sean Christopherson Feb. 3, 2021, 5:54 p.m. UTC | #2
On Wed, Feb 03, 2021, Paolo Bonzini wrote:
> Looks good!  I'll wait for a few days of reviews,

I guess I know what I'm doing this afternoon :-)

> but I'd like to queue this for 5.12 and I plan to make it the default in 5.13
> or 5.12-rc (depending on when I can ask Red Hat QE to give it a shake).

Hmm, given that kvm/queue doesn't seem to get widespread testing, I think it
should be enabled by default in rc1 for whatever kernel it targets.

Would it be too heinous to enable it by default in 5.12-rc1, knowing full well
that there's a good possibility it would get reverted?
Paolo Bonzini Feb. 3, 2021, 6:13 p.m. UTC | #3
On 03/02/21 18:54, Sean Christopherson wrote:
> On Wed, Feb 03, 2021, Paolo Bonzini wrote:
>> Looks good!  I'll wait for a few days of reviews,
> 
> I guess I know what I'm doing this afternoon :-)
> 
>> but I'd like to queue this for 5.12 and I plan to make it the default in 5.13
>> or 5.12-rc (depending on when I can ask Red Hat QE to give it a shake).
> 
> Hmm, given that kvm/queue doesn't seem to get widespread testing, I think it
> should be enabled by default in rc1 for whatever kernel it targets.
> 
> Would it be too heinous to enable it by default in 5.12-rc1, knowing full well
> that there's a good possibility it would get reverted?

Absolutely not.  However, to clarify my plan:

- what is now kvm/queue and has been reviewed will graduate to kvm/next 
in a couple of days, and then to 5.12-rc1.  Ben's patches are already in 
kvm/queue, but there's no problem in waiting another week before moving 
them to kvm/next because it's not enabled by default.  (Right now even 
CET is in kvm/queue, but it will not move to kvm/next until bare metal 
support is in).

- if this will not have been tested by Red Hat QE by say 5.12-rc3, I 
would enable it in kvm/next instead, and at that point the target would 
become the 5.13 merge window (and release).

Paolo