[v8,00/11] KVM: x86/mmu: Age sptes locklessly

Message ID	20241105184333.2305744-1-jthoughton@google.com (mailing list archive)
Headers	show Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DABD1E885E for <kvm@vger.kernel.org>; Tue, 5 Nov 2024 18:43:43 +0000 (UTC) Date: Tue, 5 Nov 2024 18:43:22 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20241105184333.2305744-1-jthoughton@google.com> Subject: [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly From: James Houghton <jthoughton@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: David Matlack <dmatlack@google.com>, David Rientjes <rientjes@google.com>, James Houghton <jthoughton@google.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Wei Xu <weixugc@google.com>, Yu Zhao <yuzhao@google.com>, Axel Rasmussen <axelrasmussen@google.com>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	KVM: x86/mmu: Age sptes locklessly \| expand [v8,00/11] KVM: x86/mmu: Age sptes locklessly [v8,01/11] KVM: Remove kvm_handle_hva_range helper functions [v8,02/11] KVM: Add lockless memslot walk to KVM [v8,03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine [v8,04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn [v8,05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn [v8,06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 [v8,07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock [v8,08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock [v8,09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs [v8,10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns [v8,11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Message ID

20241105184333.2305744-1-jthoughton@google.com (mailing list archive)

Headers

Date: Tue,  5 Nov 2024 18:43:22 +0000
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20241105184333.2305744-1-jthoughton@google.com>
Subject: [PATCH v8 00/11] KVM: x86/mmu: Age sptes locklessly
From: James Houghton <jthoughton@google.com>
To: Sean Christopherson <seanjc@google.com>,
 Paolo Bonzini <pbonzini@redhat.com>
Cc: David Matlack <dmatlack@google.com>, David Rientjes <rientjes@google.com>,
	James Houghton <jthoughton@google.com>, Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>, Wei Xu <weixugc@google.com>,
 Yu Zhao <yuzhao@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"

Series

KVM: x86/mmu: Age sptes locklessly | expand

Message

James Houghton Nov. 5, 2024, 6:43 p.m. UTC

Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
us to reduce the time/CPU it takes to do aging and the performance
impact on the vCPUs while we are aging.

The final patch in this series modifies access_tracking_stress_test to
age using MGLRU. There is a mode (-p) where it will age while the vCPUs
are faulting memory in. Here are some results with that mode:

TDP MMU disabled, no optimization:
$ ./access_tracking_perf_test -l -r /dev/cgroup/memory -v 64 -p
lru_gen avg pass duration     : 15.954388970s, (passes:1, total:15.954388970s)

TDP MMU disabled, lockless:
$ ./access_tracking_perf_test -l -r /dev/cgroup/memory -v 64 -p
lru_gen avg pass duration     : 0.527091929s, (passes:35, total:18.448217547s)

The vCPU time difference in these runs with the shadow MMU vary quite a
lot, and there doesn't seem to be a notable improvement with this
particular test.

There are some more results with the TDP MMU from v4[4].

I have also tested with Sean's mmu_stress_test changes[1].

Note: the new MGLRU mode for access_tracking_perf_test will verify that
aging is functional. It will only be functional with the MGLRU patches
that have been sent to Andrew separately[8].

=== Previous Versions ===

Since v7[7]:
 - Dropped MGLRU changes (Andrew has queued them[8]).
 - Dropped DAMON cleanup (Andrew has queued this[9]).
 - Dropped MMU notifier changes completely.
 - Made shadow MMU aging *always* lockless, not just lockless when the
   now-removed "fast_only" clear notifier was used.
 - Given that the MGLRU changes no longer introduce a new MGLRU
   capability, drop the new capability check from the selftest.
 - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
   for marking pages as dirty.

Since v6[6]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes.
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[5]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[4]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[2], see v3.

Based on latest kvm-x86/next.

[1]: https://lore.kernel.org/kvm/20241009154953.1073471-1-seanjc@google.com/
[2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
[8]: https://lore.kernel.org/linux-mm/20241019012940.3656292-1-jthoughton@google.com/
[9]: https://lore.kernel.org/linux-mm/20241021160212.9935-1-jthoughton@google.com/

James Houghton (7):
  KVM: Remove kvm_handle_hva_range helper functions
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0
  KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 arch/x86/include/asm/kvm_host.h               |   4 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        | 338 ++++++++++-----
 arch/x86/kvm/mmu/tdp_iter.h                   |  27 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  23 +-
 include/linux/kvm_host.h                      |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 366 ++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 virt/kvm/Kconfig                              |   2 +
 virt/kvm/kvm_main.c                           | 102 +++--
 12 files changed, 1124 insertions(+), 187 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: a27e0515592ec9ca28e0d027f42568c47b314784

Comments

Yu Zhao Nov. 5, 2024, 7:21 p.m. UTC | #1

On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
>
> Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
> Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
> us to reduce the time/CPU it takes to do aging and the performance
> impact on the vCPUs while we are aging.
>
> The final patch in this series modifies access_tracking_stress_test to
> age using MGLRU. There is a mode (-p) where it will age while the vCPUs
> are faulting memory in. Here are some results with that mode:

Additional background in case I didn't provide it before:

At Google we keep track of hotness/coldness of VM memory to identify
opportunities to demote cold memory into slower tiers of storage. This
is done in a controlled manner so that while we benefit from the
improved memory efficiency through improved bin-packing, without
violating customer SLOs.

However, the monitoring/tracking introduced two major overheads [1] for us:
1. the traditional (host) PFN + rmap data structures [2] used to
locate host PTEs (containing the accessed bits).
2. the KVM MMU lock required to clear the accessed bits in
secondary/shadow PTEs.

MGLRU provides the infrastructure for us to reach out into page tables
directly from a list of mm_struct's, and therefore allows us to bypass
the first problem above and reduce the CPU overhead by ~80% for our
workloads (90%+ mmaped memory). This series solves the second problem:
by supporting locklessly clearing the accessed bits in SPTEs, it would
reduce our current KVM MMU lock contention by >80% [3]. All other
existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also
seamlessly benefit from this series when monitoring/tracking VM
memory.

[1] https://lwn.net/Articles/787611/
[2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
[3] https://research.google/pubs/profiling-a-warehouse-scale-computer/

Yu Zhao Nov. 5, 2024, 7:28 p.m. UTC | #2

On Tue, Nov 5, 2024 at 12:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote:
> >
> > Andrew has queued patches to make MGLRU consult KVM when doing aging[8].
> > Now, make aging lockless for the shadow MMU and the TDP MMU. This allows
> > us to reduce the time/CPU it takes to do aging and the performance
> > impact on the vCPUs while we are aging.
> >
> > The final patch in this series modifies access_tracking_stress_test to
> > age using MGLRU. There is a mode (-p) where it will age while the vCPUs
> > are faulting memory in. Here are some results with that mode:
>
> Additional background in case I didn't provide it before:
>
> At Google we keep track of hotness/coldness of VM memory to identify
> opportunities to demote cold memory into slower tiers of storage. This
> is done in a controlled manner so that while we benefit from the
> improved memory efficiency through improved bin-packing, without
> violating customer SLOs.
>
> However, the monitoring/tracking introduced two major overheads [1] for us:
> 1. the traditional (host) PFN + rmap data structures [2] used to
> locate host PTEs (containing the accessed bits).
> 2. the KVM MMU lock required to clear the accessed bits in
> secondary/shadow PTEs.
>
> MGLRU provides the infrastructure for us to reach out into page tables
> directly from a list of mm_struct's, and therefore allows us to bypass
> the first problem above and reduce the CPU overhead by ~80% for our
> workloads (90%+ mmaped memory). This series solves the second problem:
> by supporting locklessly clearing the accessed bits in SPTEs, it would
> reduce our current KVM MMU lock contention by >80% [3]. All other
> existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also
> seamlessly benefit from this series when monitoring/tracking VM
> memory.
>
> [1] https://lwn.net/Articles/787611/
> [2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
> [3] https://research.google/pubs/profiling-a-warehouse-scale-computer/

And we also ran an A/B experiment on quarter million Chromebooks
running Android in VMs last year (with an older version of this
series):

It reduced PSI by 10% at the 99th percentile and "janks" by 8% at the
95th percentile, which resulted in an overall improvement in user
engagement by 16% at the 75th percentile (all statistically
significant).