diff mbox series

[15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak

Message ID 20210112181041.356734-16-bgardon@google.com (mailing list archive)
State New, archived
Headers show
Series Allow parallel page faults with TDP MMU | expand

Commit Message

Ben Gardon Jan. 12, 2021, 6:10 p.m. UTC
Wrap the MMU lock cond_reseched and needbreak operations in a function.
This will support a refactoring to move the lock into the struct
kvm_arch(s) so that x86 can change the spinlock to a rwlock without
affecting the performance of other archs.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/arm64/kvm/mmu.c       |  2 +-
 arch/x86/kvm/mmu/mmu.c     | 16 ++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.c |  8 ++++----
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c        | 10 ++++++++++
 5 files changed, 25 insertions(+), 13 deletions(-)

Comments

Sean Christopherson Jan. 21, 2021, 12:19 a.m. UTC | #1
On Tue, Jan 12, 2021, Ben Gardon wrote:
> Wrap the MMU lock cond_reseched and needbreak operations in a function.
> This will support a refactoring to move the lock into the struct
> kvm_arch(s) so that x86 can change the spinlock to a rwlock without
> affecting the performance of other archs.

IMO, moving the lock to arch-specific code is bad for KVM.  The architectures'
MMUs already diverge pretty horribly, and once things diverge it's really hard
to go the other direction.  And this change, along with all of the wrappers,
thrash a  lot of code and add a fair amount of indirection without any real
benefit to the other architectures.

What if we simply make the common mmu_lock a union?  The rwlock_t is probably a
bit bigger, but that's a few bytes for an entire VM.  And maybe this would
entice/inspire other architectures to move to a similar MMU model.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..bbc8efd4af62 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,7 +451,10 @@ struct kvm_memslots {
 };

 struct kvm {
-       spinlock_t mmu_lock;
+       union {
+               rwlock_t mmu_rwlock;
+               spinlock_t mmu_lock;
+       };
        struct mutex slots_lock;
        struct mm_struct *mm; /* userspace tied to this vm */
        struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
Paolo Bonzini Jan. 21, 2021, 8:17 p.m. UTC | #2
On 21/01/21 01:19, Sean Christopherson wrote:
> IMO, moving the lock to arch-specific code is bad for KVM. The 
> architectures' MMUs already diverge pretty horribly, and once things 
> diverge it's really hard to go the other direction. And this change, 
> along with all of the wrappers, thrash a lot of code and add a fair 
> amount of indirection without any real benefit to the other 
> architectures. What if we simply make the common mmu_lock a union? The 
> rwlock_t is probably a bit bigger, but that's a few bytes for an entire 
> VM. And maybe this would entice/inspire other architectures to move to a 
> similar MMU model.
I agree.  Most architectures don't do the lockless tricks that x86 do, 
and being able to lock for read would be better than nothing.  For 
example, I took a look at ARM and stage2_update_leaf_attrs could be 
changed to operate in cmpxchg-like style while holding the rwlock for read.

Paolo

> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f3b1013fb22c..bbc8efd4af62 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -451,7 +451,10 @@ struct kvm_memslots {
>  };
> 
>  struct kvm {
> -       spinlock_t mmu_lock;
> +       union {
> +               rwlock_t mmu_rwlock;
> +               spinlock_t mmu_lock;
> +       };
>         struct mutex slots_lock;
>         struct mm_struct *mm; /* userspace tied to this vm */
>         struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
Paolo Bonzini Jan. 26, 2021, 2:38 p.m. UTC | #3
On 21/01/21 01:19, Sean Christopherson wrote:
> What if we simply make the common mmu_lock a union? The rwlock_t is 
> probably a bit bigger, but that's a few bytes for an entire VM. And 
> maybe this would entice/inspire other architectures to move to a similar 
> MMU model.

Looking more at this, there is a problem in that MMU notifier functions 
take the MMU lock.

Yes, qrwlock the size is a bit larger than qspinlock.  However, the fast 
path of qrwlocks is small, and if the slow paths are tiny compared to 
the mmu_lock critical sections that are so big as to require 
cond_resched.  So I would consider just changing all architectures to an 
rwlock.

Paolo
Ben Gardon Jan. 26, 2021, 5:47 p.m. UTC | #4
On Tue, Jan 26, 2021 at 6:38 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 21/01/21 01:19, Sean Christopherson wrote:
> > What if we simply make the common mmu_lock a union? The rwlock_t is
> > probably a bit bigger, but that's a few bytes for an entire VM. And
> > maybe this would entice/inspire other architectures to move to a similar
> > MMU model.
>
> Looking more at this, there is a problem in that MMU notifier functions
> take the MMU lock.
>
> Yes, qrwlock the size is a bit larger than qspinlock.  However, the fast
> path of qrwlocks is small, and if the slow paths are tiny compared to
> the mmu_lock critical sections that are so big as to require
> cond_resched.  So I would consider just changing all architectures to an
> rwlock.

I like the idea of putting the MMU lock union directly in struct KVM
and will make that change in the next version of this series. In my
testing, I found that wholesale replacing the spin lock with an rwlock
had a noticeable negative performance impact on the legacy / shadow
MMU. Enough that it motivated me to implement this more complex union
scheme. While the difference was pronounced in the dirty log perf test
microbenchmark, it's an open question as to whether it would matter in
practice.

>
> Paolo
>
Paolo Bonzini Jan. 26, 2021, 5:55 p.m. UTC | #5
On 26/01/21 18:47, Ben Gardon wrote:
> Enough that it motivated me to implement this more complex union
> scheme. While the difference was pronounced in the dirty log perf test
> microbenchmark, it's an open question as to whether it would matter in
> practice.

I'll look at getting some numbers if it's just the dirty log perf test. 
  Did you see anything in the profile pointing specifically at rwlock?

Paolo
Ben Gardon Jan. 26, 2021, 6:11 p.m. UTC | #6
On Tue, Jan 26, 2021 at 9:55 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/01/21 18:47, Ben Gardon wrote:
> > Enough that it motivated me to implement this more complex union
> > scheme. While the difference was pronounced in the dirty log perf test
> > microbenchmark, it's an open question as to whether it would matter in
> > practice.
>
> I'll look at getting some numbers if it's just the dirty log perf test.
>   Did you see anything in the profile pointing specifically at rwlock?

When I did a strict replacement I found ~10% worse memory population
performance.
Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
disabled, I got 119 sec to populate memory as the baseline and 134 sec
with an earlier version of this series which just replaced the
spinlock with an rwlock. I believe this difference is statistically
significant, but didn't run multiple trials.
I didn't take notes when profiling, but I'm pretty sure the rwlock
slowpath showed up a lot. This was a very high contention scenario, so
it's probably not indicative of real-world performance.
In the slow path, the rwlock is certainly slower than a spin lock.

If the real impact doesn't seem too large, I'd be very happy to just
replace the spinlock.

>
> Paolo
>
Paolo Bonzini Jan. 26, 2021, 8:47 p.m. UTC | #7
On 26/01/21 19:11, Ben Gardon wrote:
> When I did a strict replacement I found ~10% worse memory population
> performance.
> Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
> disabled, I got 119 sec to populate memory as the baseline and 134 sec
> with an earlier version of this series which just replaced the
> spinlock with an rwlock. I believe this difference is statistically
> significant, but didn't run multiple trials.
> I didn't take notes when profiling, but I'm pretty sure the rwlock
> slowpath showed up a lot. This was a very high contention scenario, so
> it's probably not indicative of real-world performance.
> In the slow path, the rwlock is certainly slower than a spin lock.
> 
> If the real impact doesn't seem too large, I'd be very happy to just
> replace the spinlock.

Ok, so let's use the union idea and add a "#define KVM_HAVE_MMU_RWLOCK" 
to x86.  The virt/kvm/kvm_main.c MMU notifiers functions can use the 
#define to pick between write_lock and spin_lock.

For x86 I want to switch to tdp_mmu=1 by default as soon as parallel 
page faults are in, so we can use the rwlock unconditionally and drop 
the wrappers, except possibly for some kind of kvm_mmu_lock/unlock_root 
that choose between read_lock for TDP MMU and write_lock for shadow MMU.

Thanks!

Paolo
Ben Gardon Jan. 27, 2021, 8:08 p.m. UTC | #8
On Tue, Jan 26, 2021 at 12:48 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/01/21 19:11, Ben Gardon wrote:
> > When I did a strict replacement I found ~10% worse memory population
> > performance.
> > Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
> > disabled, I got 119 sec to populate memory as the baseline and 134 sec
> > with an earlier version of this series which just replaced the
> > spinlock with an rwlock. I believe this difference is statistically
> > significant, but didn't run multiple trials.
> > I didn't take notes when profiling, but I'm pretty sure the rwlock
> > slowpath showed up a lot. This was a very high contention scenario, so
> > it's probably not indicative of real-world performance.
> > In the slow path, the rwlock is certainly slower than a spin lock.
> >
> > If the real impact doesn't seem too large, I'd be very happy to just
> > replace the spinlock.
>
> Ok, so let's use the union idea and add a "#define KVM_HAVE_MMU_RWLOCK"
> to x86.  The virt/kvm/kvm_main.c MMU notifiers functions can use the
> #define to pick between write_lock and spin_lock.

I'm not entirely sure I understand this suggestion. Are you suggesting
we'd have the spinlock and rwlock in a union in struct kvm but then
use a static define to choose which one is used by other functions? It
seems like if we're using static defines the union doesn't add value.
If we do use the union, I think the advantages offered by __weak
wrapper functions, overridden on a per-arch basis, are worthwhile.

>
> For x86 I want to switch to tdp_mmu=1 by default as soon as parallel
> page faults are in, so we can use the rwlock unconditionally and drop
> the wrappers, except possibly for some kind of kvm_mmu_lock/unlock_root
> that choose between read_lock for TDP MMU and write_lock for shadow MMU.
>
> Thanks!
>
> Paolo
>
Paolo Bonzini Jan. 27, 2021, 8:55 p.m. UTC | #9
On 27/01/21 21:08, Ben Gardon wrote:
> I'm not entirely sure I understand this suggestion. Are you suggesting
> we'd have the spinlock and rwlock in a union in struct kvm but then
> use a static define to choose which one is used by other functions? It
> seems like if we're using static defines the union doesn't add value.

Of course you're right.  You'd just place the #ifdef in the struct kvm 
definition.

You can place static inline functions for lock/unlock in 
virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.

Paolo
Ben Gardon Jan. 27, 2021, 9:20 p.m. UTC | #10
On Wed, Jan 27, 2021 at 12:55 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 27/01/21 21:08, Ben Gardon wrote:
> > I'm not entirely sure I understand this suggestion. Are you suggesting
> > we'd have the spinlock and rwlock in a union in struct kvm but then
> > use a static define to choose which one is used by other functions? It
> > seems like if we're using static defines the union doesn't add value.
>
> Of course you're right.  You'd just place the #ifdef in the struct kvm
> definition.

Ah okay, thanks for clarifying.

>
> You can place static inline functions for lock/unlock in
> virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.

Would you prefer to make that change in this series or at a later
date? I'm assuming this would replace all the wrapper functions and
mean that x86 is rwlock only.

>
> Paolo
>
Paolo Bonzini Jan. 28, 2021, 8:18 a.m. UTC | #11
On 27/01/21 22:20, Ben Gardon wrote:
> On Wed, Jan 27, 2021 at 12:55 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 27/01/21 21:08, Ben Gardon wrote:
>>> I'm not entirely sure I understand this suggestion. Are you suggesting
>>> we'd have the spinlock and rwlock in a union in struct kvm but then
>>> use a static define to choose which one is used by other functions? It
>>> seems like if we're using static defines the union doesn't add value.
>>
>> Of course you're right.  You'd just place the #ifdef in the struct kvm
>> definition.
> 
> Ah okay, thanks for clarifying.
> 
>>
>> You can place static inline functions for lock/unlock in
>> virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.
> 
> Would you prefer to make that change in this series or at a later
> date? I'm assuming this would replace all the wrapper functions and
> mean that x86 is rwlock only.

Yes, exactly.  I would like to make tdp_mmu=1 the default as soon as 
parallel page faults are in (and thus scalability should be on par with 
the shadow MMU).

Paolo
diff mbox series

Patch

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 402b1642c944..57ef1ec23b56 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -58,7 +58,7 @@  static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 			break;
 
 		if (resched && next != end)
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 	} while (addr = next, addr != end);
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a4577830606..659ed0a2875f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2016,9 +2016,9 @@  static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(vcpu->kvm)) {
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-			cond_resched_lock(&vcpu->kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(vcpu->kvm);
 			flush = false;
 		}
 	}
@@ -5233,14 +5233,14 @@  slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		if (iterator.rmap)
 			flush |= fn(kvm, iterator.rmap);
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 			if (flush && lock_flush_tlb) {
 				kvm_flush_remote_tlbs_with_address(kvm,
 						start_gfn,
 						iterator.gfn - start_gfn + 1);
 				flush = false;
 			}
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 		}
 	}
 
@@ -5390,7 +5390,7 @@  static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		 * be in active use by the guest.
 		 */
 		if (batch >= BATCH_ZAP_PAGES &&
-		    cond_resched_lock(&kvm->mmu_lock)) {
+		    kvm_mmu_lock_cond_resched(kvm)) {
 			batch = 0;
 			goto restart;
 		}
@@ -5688,7 +5688,7 @@  void kvm_mmu_zap_all(struct kvm *kvm)
 			continue;
 		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
 			goto restart;
-		if (cond_resched_lock(&kvm->mmu_lock))
+		if (kvm_mmu_lock_cond_resched(kvm))
 			goto restart;
 	}
 
@@ -6013,9 +6013,9 @@  static void kvm_recover_nx_lpages(struct kvm *kvm)
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 			kvm_mmu_commit_zap_page(kvm, &invalid_list);
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 		}
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 90807f2d928f..fb911ca428b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -488,10 +488,10 @@  static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
 		struct tdp_iter *iter)
 {
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+	if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 		kvm_flush_remote_tlbs(kvm);
 		rcu_read_unlock();
-		cond_resched_lock(&kvm->mmu_lock);
+		kvm_mmu_lock_cond_resched(kvm);
 		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
@@ -512,9 +512,9 @@  static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
  */
 static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+	if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 		rcu_read_unlock();
-		cond_resched_lock(&kvm->mmu_lock);
+		kvm_mmu_lock_cond_resched(kvm);
 		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 433d14fdae30..6e2773fc406c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1497,5 +1497,7 @@  static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_lock(struct kvm *kvm);
 void kvm_mmu_unlock(struct kvm *kvm);
+int kvm_mmu_lock_needbreak(struct kvm *kvm);
+int kvm_mmu_lock_cond_resched(struct kvm *kvm);
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 32f97ed1188d..b4c49a7e0556 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -442,6 +442,16 @@  void kvm_mmu_unlock(struct kvm *kvm)
 	spin_unlock(&kvm->mmu_lock);
 }
 
+int kvm_mmu_lock_needbreak(struct kvm *kvm)
+{
+	return spin_needbreak(&kvm->mmu_lock);
+}
+
+int kvm_mmu_lock_cond_resched(struct kvm *kvm)
+{
+	return cond_resched_lock(&kvm->mmu_lock);
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {