diff mbox series

[v6,03/11] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Message ID 20240724011037.3671523-4-jthoughton@google.com (mailing list archive)
State New, archived
Headers show
Series mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand

Commit Message

James Houghton July 24, 2024, 1:10 a.m. UTC
Replace the MMU write locks (taken in the memslot iteration loop) for
read locks.

Grabbing the read lock instead of the write lock is safe because the
only requirement we have is that the stage-2 page tables do not get
deallocated while we are walking them. The stage2_age_walker() callback
is safe to race with itself; update the comment to reflect the
synchronization change.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/Kconfig       |  1 +
 arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------
 arch/arm64/kvm/mmu.c         | 30 ++++++++++++++++++++++--------
 3 files changed, 32 insertions(+), 14 deletions(-)

Comments

James Houghton July 25, 2024, 9:55 p.m. UTC | #1
On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
>
> Replace the MMU write locks (taken in the memslot iteration loop) for
> read locks.
>
> Grabbing the read lock instead of the write lock is safe because the
> only requirement we have is that the stage-2 page tables do not get
> deallocated while we are walking them. The stage2_age_walker() callback
> is safe to race with itself; update the comment to reflect the
> synchronization change.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---

Here is some data to show that this patch at least *can* be helpful:

# arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
# The test is faulting memory in while doing aging as fast as possible.
# taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
-p -v 32 -m 3

# Write lock
vcpu wall time                : 3.039207157s
lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)

# Read lock
vcpu wall time                : 3.010848445s
lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)

Aging is able to run significantly faster, but vCPU runtime isn't
affected much (in this test).

It would be really nice to motivate this patch with a test that didn't
require patching the kernel... Oliver and Marc, please let me know if
you'd like to see more data. I'm also happy to simply drop this patch.
Sean Christopherson Aug. 17, 2024, 12:46 a.m. UTC | #2
On Thu, Jul 25, 2024, James Houghton wrote:
> On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Replace the MMU write locks (taken in the memslot iteration loop) for
> > read locks.
> >
> > Grabbing the read lock instead of the write lock is safe because the
> > only requirement we have is that the stage-2 page tables do not get
> > deallocated while we are walking them. The stage2_age_walker() callback
> > is safe to race with itself; update the comment to reflect the
> > synchronization change.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> 
> Here is some data to show that this patch at least *can* be helpful:
> 
> # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
> # The test is faulting memory in while doing aging as fast as possible.
> # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
> -p -v 32 -m 3
> 
> # Write lock
> vcpu wall time                : 3.039207157s
> lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)
> 
> # Read lock
> vcpu wall time                : 3.010848445s
> lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)
> 
> Aging is able to run significantly faster, but vCPU runtime isn't
> affected much (in this test).

Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
be due to KVM arm64 taking mmap_lock for read when handling faults:

https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
Yu Zhao Aug. 17, 2024, 1:03 a.m. UTC | #3
On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jul 25, 2024, James Houghton wrote:
> > On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > Replace the MMU write locks (taken in the memslot iteration loop) for
> > > read locks.
> > >
> > > Grabbing the read lock instead of the write lock is safe because the
> > > only requirement we have is that the stage-2 page tables do not get
> > > deallocated while we are walking them. The stage2_age_walker() callback
> > > is safe to race with itself; update the comment to reflect the
> > > synchronization change.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> >
> > Here is some data to show that this patch at least *can* be helpful:
> >
> > # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY)
> > # The test is faulting memory in while doing aging as fast as possible.
> > # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory
> > -p -v 32 -m 3
> >
> > # Write lock
> > vcpu wall time                : 3.039207157s
> > lru_gen avg pass duration     : 1.660541541s, (passes:2, total:3.321083083s)
> >
> > # Read lock
> > vcpu wall time                : 3.010848445s
> > lru_gen avg pass duration     : 0.306623698s, (passes:11, total:3.372860688s)
> >
> > Aging is able to run significantly faster, but vCPU runtime isn't
> > affected much (in this test).
>
> Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> be due to KVM arm64 taking mmap_lock for read when handling faults:
>
> https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com

For the above test, I don't think it's mmap_lock -- the reclaim path,
e.g., when zswapping guest memory, has two stages: aging (scanning
PTEs) and eviction (unmapping PTEs). Only testing the former isn't
realistic at all. IOW, for a r/w lock use case, only testing the read
lock path would be bad coverage.
Oliver Upton Aug. 19, 2024, 8:41 p.m. UTC | #4
On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:

[...]

> > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > be due to KVM arm64 taking mmap_lock for read when handling faults:
> >
> > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> 
> For the above test, I don't think it's mmap_lock

Yeah, I don't think this is related to the mmap_lock.

James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
fault for an Access flag update. Even if he's on a machine w/o it,
Access flag faults are handled outside the mmap_lock.

Forcing SW management of the AF at stage-2 would be the best case for
demonstrating the locking improvement:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a24a2a857456..a640e8a8c6ea 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	 * happen to be running on a design that has unadvertised support for
 	 * HAFDBS. Here be dragons.
 	 */
-	if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
-		vtcr |= VTCR_EL2_HA;
 #endif /* CONFIG_ARM64_HW_AFDBM */
 
 	if (kvm_lpa2_is_enabled())

Changing the config option would work too, but I wasn't sure if
FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics.

> -- the reclaim path,
> e.g., when zswapping guest memory, has two stages: aging (scanning
> PTEs) and eviction (unmapping PTEs). Only testing the former isn't
> realistic at all.

AIUI, the intention of this test data is to provide some justification
for why Marc + I should consider the locking change *outside* of any
MMU notifier changes. So from that POV, this is meant as a hacked
up microbenchmark and not meant to be realistic.

And really, the arm64 change has nothing to do with this series at
this point, which is disappointing. In the interest of moving this
feature along for both architectures, would you be able help James
with:

 - Identifying a benchmark that you believe is realistic

 - Suggestions on how to run that benchmark on Google infrastructure

Asking since you had a setup / data earlier on when you were carrying
the series. Hopefully with supportive data we can get arm64 to opt-in
to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
Sean Christopherson Aug. 19, 2024, 10:47 p.m. UTC | #5
On Mon, Aug 19, 2024, Oliver Upton wrote:
> On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> [...]
> 
> > > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > > be due to KVM arm64 taking mmap_lock for read when handling faults:
> > >
> > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> > 
> > For the above test, I don't think it's mmap_lock
> 
> Yeah, I don't think this is related to the mmap_lock.
> 
> James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
> fault for an Access flag update.

Huh, didn't know that was a thing on ARM.  Ooh, that lends even more credence to
my assertion that marking folios accessed in handle_access_fault() can go away[*].
I assume hardware-assisted updates means this code in handle_access_fault() will
no longer be hit, as KVM simply won't ever get access faults?  If so, I'll add
that info to the changelog.

	if (kvm_pte_valid(pte))
		kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));

[*] https://lore.kernel.org/all/20240726235234.228822-83-seanjc@google.com

> Even if he's on a machine w/o it, Access flag faults are handled outside the
> mmap_lock.

Oh, right, they go down handle_access_fault(), not user_mem_abort().

Reviewing late Friday afternoon, never a good idea ;-)
James Houghton Aug. 30, 2024, 12:33 a.m. UTC | #6
On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote:
> > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> [...]
>
> > > Were you expecting vCPU runtime to improve (more)?  If so, lack of movement could
> > > be due to KVM arm64 taking mmap_lock for read when handling faults:

I had no real expectation. I was hoping that maybe there could be a
vCPU runtime improvement, given that user_mem_abort() (being called
because we're faulting memory in continuously in this test) has to
take the KVM MMU lock for reading, and aging is taking it for reading
vs. writing. I think that's why aging is a lot slower when using the
write lock: it is waiting for the readers to drop the lock, but I
guess the delay on the *readers* due to the pending writer seems to be
pretty minimal.

> > >
> > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
> >
> > For the above test, I don't think it's mmap_lock
>
> Yeah, I don't think this is related to the mmap_lock.
>
> James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't
> fault for an Access flag update. Even if he's on a machine w/o it,
> Access flag faults are handled outside the mmap_lock.

Yeah I was running on Ampere Altra CPUs.

> Forcing SW management of the AF at stage-2 would be the best case for
> demonstrating the locking improvement:
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a24a2a857456..a640e8a8c6ea 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
>          * happen to be running on a design that has unadvertised support for
>          * HAFDBS. Here be dragons.
>          */
> -       if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
> -               vtcr |= VTCR_EL2_HA;
>  #endif /* CONFIG_ARM64_HW_AFDBM */
>
>         if (kvm_lpa2_is_enabled())

Thanks!

> Changing the config option would work too, but I wasn't sure if
> FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics.

Indeed, disabling CONFIG_ARM64_HW_AFDBM will cause MGLRU not to do aging.

> > -- the reclaim path,
> > e.g., when zswapping guest memory, has two stages: aging (scanning
> > PTEs) and eviction (unmapping PTEs). Only testing the former isn't
> > realistic at all.
>
> AIUI, the intention of this test data is to provide some justification
> for why Marc + I should consider the locking change *outside* of any
> MMU notifier changes. So from that POV, this is meant as a hacked
> up microbenchmark and not meant to be realistic.
>
> And really, the arm64 change has nothing to do with this series at
> this point, which is disappointing. In the interest of moving this
> feature along for both architectures, would you be able help James
> with:
>
>  - Identifying a benchmark that you believe is realistic
>
>  - Suggestions on how to run that benchmark on Google infrastructure
>
> Asking since you had a setup / data earlier on when you were carrying
> the series. Hopefully with supportive data we can get arm64 to opt-in
> to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.

I'll keep trying some other approaches I can take for getting similar
testing that Yu had; it is somewhat difficult for me to reproduce
those tests (and it really shouldn't be.... sorry).

I think it makes most sense for me to drop the arm64 patch for now and
re-propose it (or something stronger) alongside enabling aging. Does
that sound ok?
Oliver Upton Aug. 30, 2024, 12:48 a.m. UTC | #7
On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > Asking since you had a setup / data earlier on when you were carrying
> > the series. Hopefully with supportive data we can get arm64 to opt-in
> > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> 
> I'll keep trying some other approaches I can take for getting similar
> testing that Yu had; it is somewhat difficult for me to reproduce
> those tests (and it really shouldn't be.... sorry).

No need to apologize. Getting good test hardware for arm64 is a complete
chore. Sure would love a functional workstation with cores from this
decade...

> I think it makes most sense for me to drop the arm64 patch for now and
> re-propose it (or something stronger) alongside enabling aging. Does
> that sound ok?

I'm a bit disappointed that we haven't gotten forward progress on the
arm64 patches, but I also recognize this is the direction of travel as
the x86 patches are shaping up.

So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
soon while the context is still fresh.
David Matlack Aug. 30, 2024, 3:33 p.m. UTC | #8
On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > Asking since you had a setup / data earlier on when you were carrying
> > > the series. Hopefully with supportive data we can get arm64 to opt-in
> > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> >
> > I'll keep trying some other approaches I can take for getting similar
> > testing that Yu had; it is somewhat difficult for me to reproduce
> > those tests (and it really shouldn't be.... sorry).
>
> No need to apologize. Getting good test hardware for arm64 is a complete
> chore. Sure would love a functional workstation with cores from this
> decade...
>
> > I think it makes most sense for me to drop the arm64 patch for now and
> > re-propose it (or something stronger) alongside enabling aging. Does
> > that sound ok?
>
> I'm a bit disappointed that we haven't gotten forward progress on the
> arm64 patches, but I also recognize this is the direction of travel as
> the x86 patches are shaping up.
>
> So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
> soon while the context is still fresh.

Converting the aging notifiers to holding mmu_lock for read seems like
a pure win and minimal churn. Can we keep that patch in v7 (which
depends on the lockless notifier refactor, i.e. is not completely
stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent
series.
>
> --
> Thanks,
> Oliver
Oliver Upton Aug. 30, 2024, 5:38 p.m. UTC | #9
Hey David,

On Fri, Aug 30, 2024 at 08:33:59AM -0700, David Matlack wrote:
> On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote:
> > > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > Asking since you had a setup / data earlier on when you were carrying
> > > > the series. Hopefully with supportive data we can get arm64 to opt-in
> > > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
> > >
> > > I'll keep trying some other approaches I can take for getting similar
> > > testing that Yu had; it is somewhat difficult for me to reproduce
> > > those tests (and it really shouldn't be.... sorry).
> >
> > No need to apologize. Getting good test hardware for arm64 is a complete
> > chore. Sure would love a functional workstation with cores from this
> > decade...
> >
> > > I think it makes most sense for me to drop the arm64 patch for now and
> > > re-propose it (or something stronger) alongside enabling aging. Does
> > > that sound ok?
> >
> > I'm a bit disappointed that we haven't gotten forward progress on the
> > arm64 patches, but I also recognize this is the direction of travel as
> > the x86 patches are shaping up.
> >
> > So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out
> > soon while the context is still fresh.
> 
> Converting the aging notifiers to holding mmu_lock for read seems like
> a pure win and minimal churn. Can we keep that patch in v7 (which
> depends on the lockless notifier refactor, i.e. is not completely
> stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent
> series.

Even though the churn is minimal in LOC, locking changes are significant. If
one thing has become clear, there are some strong opinions about arm64
participating in MGLRU w/ the read lock. So it is almost guaranteed that
these read lock changes will eventually get thrown out in favor of an
RCU-protected walker.

Then we're stuck with potentially 3 flavors of locking in kernels that
people actually use, and dealing with breakage that only affects that
intermediate step is gonna be annoying.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 58f09370d17e..7a1af8141c0e 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,6 +22,7 @@  menuconfig KVM
 	select KVM_COMMON
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_GENERIC_MMU_NOTIFIER
+	select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select KVM_MMIO
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9e2bbee77491..a24a2a857456 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1319,10 +1319,10 @@  static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	data->young = true;
 
 	/*
-	 * stage2_age_walker() is always called while holding the MMU lock for
-	 * write, so this will always succeed. Nonetheless, this deliberately
-	 * follows the race detection pattern of the other stage-2 walkers in
-	 * case the locking mechanics of the MMU notifiers is ever changed.
+	 * This walk is not exclusive; the PTE is permitted to change from
+	 * under us. If there is a race to update this PTE, then the GFN is
+	 * most likely young, so failing to clear the AF is likely to be
+	 * inconsequential.
 	 */
 	if (data->mkold && !stage2_try_set_pte(ctx, new))
 		return -EAGAIN;
@@ -1345,10 +1345,13 @@  bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_age_walker,
 		.arg		= &data,
-		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.flags		= KVM_PGTABLE_WALK_LEAF |
+				  KVM_PGTABLE_WALK_SHARED,
 	};
+	int r;
 
-	WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
+	r = kvm_pgtable_walk(pgt, addr, size, &walker);
+	WARN_ON_ONCE(r && r != -EAGAIN);
 	return data.young;
 }
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..e37765f6f2a1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1912,29 +1912,43 @@  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
+	bool young = false;
+
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		goto out;
 
-	return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, true);
+	young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+						    range->start << PAGE_SHIFT,
+						    size, true);
 	/*
 	 * TODO: Handle nested_mmu structures here using the reverse mapping in
 	 * a later version of patch series.
 	 */
+
+out:
+	read_unlock(&kvm->mmu_lock);
+	return young;
 }
 
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
+	bool young = false;
+
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		goto out;
 
-	return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, false);
+	young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+						    range->start << PAGE_SHIFT,
+						    size, false);
+
+out:
+	read_unlock(&kvm->mmu_lock);
+	return young;
 }
 
 phys_addr_t kvm_mmu_get_httbr(void)