diff mbox series

KVM: Boost vCPU candidiate in user mode which is delivering interrupt

Message ID 1618542490-14756-1-git-send-email-wanpengli@tencent.com (mailing list archive)
State New, archived
Headers show
Series KVM: Boost vCPU candidiate in user mode which is delivering interrupt | expand

Commit Message

Wanpeng Li April 16, 2021, 3:08 a.m. UTC
From: Wanpeng Li <wanpengli@tencent.com>

Both lock holder vCPU and IPI receiver that has halted are condidate for 
boost. However, the PLE handler was originally designed to deal with the 
lock holder preemption problem. The Intel PLE occurs when the spinlock 
waiter is in kernel mode. This assumption doesn't hold for IPI receiver, 
they can be in either kernel or user mode. the vCPU candidate in user mode 
will not be boosted even if they should respond to IPIs. Some benchmarks 
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number 
of continuous PLE events because the IPI sender causes PLE events 
repeatedly until the receiver is scheduled while the receiver is not 
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery 
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs 
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 
96 HTs Intel CLX box). There is no performance regression for other 
benchmarks like Unixbench spawn (most of the time contend read/write 
lock in kernel mode), ebizzy (most of the time contend read/write sem 
and TLB shoodtdown in kernel mode).

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/x86.c       | 8 ++++++++
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 6 ++++++
 3 files changed, 15 insertions(+)

Comments

Paolo Bonzini April 17, 2021, 1:09 p.m. UTC | #1
On 16/04/21 05:08, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> Both lock holder vCPU and IPI receiver that has halted are condidate for
> boost. However, the PLE handler was originally designed to deal with the
> lock holder preemption problem. The Intel PLE occurs when the spinlock
> waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
> they can be in either kernel or user mode. the vCPU candidate in user mode
> will not be boosted even if they should respond to IPIs. Some benchmarks
> like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
> of the time they are running in user mode. It can lead to a large number
> of continuous PLE events because the IPI sender causes PLE events
> repeatedly until the receiver is scheduled while the receiver is not
> candidate for a boost.
> 
> This patch boosts the vCPU candidiate in user mode which is delivery
> interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
> VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
> 96 HTs Intel CLX box). There is no performance regression for other
> benchmarks like Unixbench spawn (most of the time contend read/write
> lock in kernel mode), ebizzy (most of the time contend read/write sem
> and TLB shoodtdown in kernel mode).
>   
> +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu->arch.apicv_active && static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
> +		return true;
> +
> +	return false;
> +}

Can you reuse vcpu_dy_runnable instead of this new function?

Paolo



>   bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>   {
>   	return vcpu->arch.preempted_in_kernel;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3b06d12..5012fc4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -954,6 +954,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
>   bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
>   int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
>   bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
> +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu);
>   int kvm_arch_post_init_vm(struct kvm *kvm);
>   void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0a481e7..781d2db 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3012,6 +3012,11 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
>   	return false;
>   }
>   
> +bool __weak kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return false;
> +}
> +
>   void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>   {
>   	struct kvm *kvm = me->kvm;
> @@ -3045,6 +3050,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>   			    !vcpu_dy_runnable(vcpu))
>   				continue;
>   			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> +				!kvm_arch_interrupt_delivery(vcpu) &&
>   				!kvm_arch_vcpu_in_kernel(vcpu))
>   				continue;
>   			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>
Wanpeng Li April 19, 2021, 7:34 a.m. UTC | #2
On Sat, 17 Apr 2021 at 21:09, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 16/04/21 05:08, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > Both lock holder vCPU and IPI receiver that has halted are condidate for
> > boost. However, the PLE handler was originally designed to deal with the
> > lock holder preemption problem. The Intel PLE occurs when the spinlock
> > waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
> > they can be in either kernel or user mode. the vCPU candidate in user mode
> > will not be boosted even if they should respond to IPIs. Some benchmarks
> > like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
> > of the time they are running in user mode. It can lead to a large number
> > of continuous PLE events because the IPI sender causes PLE events
> > repeatedly until the receiver is scheduled while the receiver is not
> > candidate for a boost.
> >
> > This patch boosts the vCPU candidiate in user mode which is delivery
> > interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
> > VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
> > 96 HTs Intel CLX box). There is no performance regression for other
> > benchmarks like Unixbench spawn (most of the time contend read/write
> > lock in kernel mode), ebizzy (most of the time contend read/write sem
> > and TLB shoodtdown in kernel mode).
> >
> > +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
> > +{
> > +     if (vcpu->arch.apicv_active && static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
> > +             return true;
> > +
> > +     return false;
> > +}
>
> Can you reuse vcpu_dy_runnable instead of this new function?

I have some concerns. For x86 arch, vcpu_dy_runnable() will add extra
vCPU candidates by KVM_REQ_EVENT and async pf(which has already
opportunistically made the guest do other stuff). For other arches,
kvm_arch_dy_runnale() is equal to kvm_arch_vcpu_runnable() except
powerpc which has too many events and is not conservative. In general,
 vcpu_dy_runnable() will loose the conditions and add more vCPU
candidates.

    Wanpeng
Sean Christopherson April 19, 2021, 4:32 p.m. UTC | #3
On Mon, Apr 19, 2021, Wanpeng Li wrote:
> On Sat, 17 Apr 2021 at 21:09, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 16/04/21 05:08, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > Both lock holder vCPU and IPI receiver that has halted are condidate for
> > > boost. However, the PLE handler was originally designed to deal with the
> > > lock holder preemption problem. The Intel PLE occurs when the spinlock
> > > waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
> > > they can be in either kernel or user mode. the vCPU candidate in user mode
> > > will not be boosted even if they should respond to IPIs. Some benchmarks
> > > like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
> > > of the time they are running in user mode. It can lead to a large number
> > > of continuous PLE events because the IPI sender causes PLE events
> > > repeatedly until the receiver is scheduled while the receiver is not
> > > candidate for a boost.
> > >
> > > This patch boosts the vCPU candidiate in user mode which is delivery
> > > interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
> > > VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
> > > 96 HTs Intel CLX box). There is no performance regression for other
> > > benchmarks like Unixbench spawn (most of the time contend read/write
> > > lock in kernel mode), ebizzy (most of the time contend read/write sem
> > > and TLB shoodtdown in kernel mode).
> > >
> > > +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
> > > +{
> > > +     if (vcpu->arch.apicv_active && static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
> > > +             return true;
> > > +
> > > +     return false;
> > > +}
> >
> > Can you reuse vcpu_dy_runnable instead of this new function?
> 
> I have some concerns. For x86 arch, vcpu_dy_runnable() will add extra
> vCPU candidates by KVM_REQ_EVENT

Is bringing in KVM_REQ_EVENT a bad thing though?  I don't see how using apicv is
special in this case.  apicv is more precise and so there will be fewer false
positives, but it's still just a guess on KVM's part since the interrupt could
be for something completely unrelated.

If false positives are a big concern, what about adding another pass to the loop
and only yielding to usermode vCPUs with interrupts in the second full pass?
I.e. give vCPUs that are already in kernel mode priority, and only yield to
handle an interrupt if there are no vCPUs in kernel mode.

kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.

> and async pf(which has already opportunistically made the guest do other stuff).

Any reason not to use kvm_arch_dy_runnable() directly?

> For other arches, kvm_arch_dy_runnale() is equal to kvm_arch_vcpu_runnable()
> except powerpc which has too many events and is not conservative. In general,
> vcpu_dy_runnable() will loose the conditions and add more vCPU candidates.
> 
>     Wanpeng
Paolo Bonzini April 19, 2021, 4:59 p.m. UTC | #4
On 19/04/21 18:32, Sean Christopherson wrote:
> If false positives are a big concern, what about adding another pass to the loop
> and only yielding to usermode vCPUs with interrupts in the second full pass?
> I.e. give vCPUs that are already in kernel mode priority, and only yield to
> handle an interrupt if there are no vCPUs in kernel mode.
> 
> kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.

pv_unhalted won't help if you're waiting for a kernel spinlock though, 
would it?  Doing two passes (or looking for a "best" candidate that 
prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt) 
seems like the best choice overall.

Paolo
Wanpeng Li April 20, 2021, 6:02 a.m. UTC | #5
On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 19/04/21 18:32, Sean Christopherson wrote:
> > If false positives are a big concern, what about adding another pass to the loop
> > and only yielding to usermode vCPUs with interrupts in the second full pass?
> > I.e. give vCPUs that are already in kernel mode priority, and only yield to
> > handle an interrupt if there are no vCPUs in kernel mode.
> >
> > kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.
>
> pv_unhalted won't help if you're waiting for a kernel spinlock though,
> would it?  Doing two passes (or looking for a "best" candidate that
> prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt)
> seems like the best choice overall.

How about something like this:

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6b4dd95..8ba50be 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -325,10 +325,12 @@ struct kvm_vcpu {
      * Cpu relax intercept or pause loop exit optimization
      * in_spin_loop: set when a vcpu does a pause loop exit
      *  or cpu relax intercepted.
+     * pending_interrupt: set when a vcpu waiting for an interrupt
      * dy_eligible: indicates whether vcpu is eligible for directed yield.
      */
     struct {
         bool in_spin_loop;
+        bool pending_interrupt;
         bool dy_eligible;
     } spin_loop;
 #endif
@@ -1427,6 +1429,12 @@ static inline void
kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
 {
     vcpu->spin_loop.in_spin_loop = val;
 }
+
+static inline void kvm_vcpu_set_pending_interrupt(struct kvm_vcpu
*vcpu, bool val)
+{
+    vcpu->spin_loop.pending__interrupt = val;
+}
+
 static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
     vcpu->spin_loop.dy_eligible = val;
@@ -1438,6 +1446,10 @@ static inline void
kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
 {
 }

+static inline void kvm_vcpu_set_pending_interrupt(struct kvm_vcpu
*vcpu, bool val)
+{
+}
+
 static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 529cff1..42e0255 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -410,6 +410,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu,
struct kvm *kvm, unsigned id)
     INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);

     kvm_vcpu_set_in_spin_loop(vcpu, false);
+    kvm_vcpu_set_pending_interrupt(vcpu, false);
     kvm_vcpu_set_dy_eligible(vcpu, false);
     vcpu->preempted = false;
     vcpu->ready = false;
@@ -3079,14 +3080,17 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
  * Helper that checks whether a VCPU is eligible for directed yield.
  * Most eligible candidate to yield is decided by following heuristics:
  *
- *  (a) VCPU which has not done pl-exit or cpu relax intercepted recently
- *  (preempted lock holder), indicated by @in_spin_loop.
- *  Set at the beginning and cleared at the end of interception/PLE handler.
+ *  (a) VCPU which has not done pl-exit or cpu relax intercepted and is not
+ *  waiting for an interrupt recently (preempted lock holder). The former
+ *  one is indicated by @in_spin_loop, set at the beginning and cleared at
+ *  the end of interception/PLE handler. The later one is indicated by
+ *  @pending_interrupt, set when interrupt is delivering and cleared at
+ *  the end of directed yield.
  *
- *  (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
- *  chance last time (mostly it has become eligible now since we have probably
- *  yielded to lockholder in last iteration. This is done by toggling
- *  @dy_eligible each time a VCPU checked for eligibility.)
+ *  (b) VCPU which has done pl-exit/ cpu relax intercepted or is waiting for
+ *  interrupt but did not get chance last time (mostly it has become eligible
+ *  now since we have probably yielded to lockholder in last iteration. This
+ *  is done by toggling @dy_eligible each time a VCPU checked for eligibility.)
  *
  *  Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
  *  to preempted lock-holder could result in wrong VCPU selection and CPU
@@ -3102,10 +3106,10 @@ static bool
kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
     bool eligible;

-    eligible = !vcpu->spin_loop.in_spin_loop ||
+    eligible = !(vcpu->spin_loop.in_spin_loop ||
vcpu->spin_loop.has_interrupt) ||
             vcpu->spin_loop.dy_eligible;

-    if (vcpu->spin_loop.in_spin_loop)
+    if (vcpu->spin_loop.in_spin_loop || vcpu->spin_loop.has_interrupt)
         kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible);

     return eligible;
@@ -3137,6 +3141,16 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
     return false;
 }

+static bool kvm_has_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+    if (vcpu_dy_runnable(vcpu)) {
+        kvm_vcpu_set_pending_interrupt(vcpu, true);
+        return true;
+    }
+
+    return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
     struct kvm *kvm = me->kvm;
@@ -3170,6 +3184,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)
                 !vcpu_dy_runnable(vcpu))
                 continue;
             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !kvm_has_interrupt_delivery(vcpu) &&
                 !kvm_arch_vcpu_in_kernel(vcpu))
                 continue;
             if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
@@ -3177,6 +3192,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)

             yielded = kvm_vcpu_yield_to(vcpu);
             if (yielded > 0) {
+                kvm_vcpu_set_pending_interrupt(vcpu, false);
                 kvm->last_boosted_vcpu = i;
                 break;
             } else if (yielded < 0) {
Wanpeng Li April 20, 2021, 6:08 a.m. UTC | #6
On Tue, 20 Apr 2021 at 14:02, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 19/04/21 18:32, Sean Christopherson wrote:
> > > If false positives are a big concern, what about adding another pass to the loop
> > > and only yielding to usermode vCPUs with interrupts in the second full pass?
> > > I.e. give vCPUs that are already in kernel mode priority, and only yield to
> > > handle an interrupt if there are no vCPUs in kernel mode.
> > >
> > > kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.
> >
> > pv_unhalted won't help if you're waiting for a kernel spinlock though,
> > would it?  Doing two passes (or looking for a "best" candidate that
> > prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt)
> > seems like the best choice overall.
>
> How about something like this:

Sorry, should be the codes below:

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6b4dd95..9bc5f87 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -325,10 +325,12 @@ struct kvm_vcpu {
      * Cpu relax intercept or pause loop exit optimization
      * in_spin_loop: set when a vcpu does a pause loop exit
      *  or cpu relax intercepted.
+     * pending_interrupt: set when a vcpu waiting for an interrupt
      * dy_eligible: indicates whether vcpu is eligible for directed yield.
      */
     struct {
         bool in_spin_loop;
+        bool pending_interrupt;
         bool dy_eligible;
     } spin_loop;
 #endif
@@ -1427,6 +1429,12 @@ static inline void
kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
 {
     vcpu->spin_loop.in_spin_loop = val;
 }
+
+static inline void kvm_vcpu_set_pending_interrupt(struct kvm_vcpu
*vcpu, bool val)
+{
+    vcpu->spin_loop.pending_interrupt = val;
+}
+
 static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
     vcpu->spin_loop.dy_eligible = val;
@@ -1438,6 +1446,10 @@ static inline void
kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
 {
 }

+static inline void kvm_vcpu_set_pending_interrupt(struct kvm_vcpu
*vcpu, bool val)
+{
+}
+
 static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 529cff1..bf6f1ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -410,6 +410,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu,
struct kvm *kvm, unsigned id)
     INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);

     kvm_vcpu_set_in_spin_loop(vcpu, false);
+    kvm_vcpu_set_pending_interrupt(vcpu, false);
     kvm_vcpu_set_dy_eligible(vcpu, false);
     vcpu->preempted = false;
     vcpu->ready = false;
@@ -3079,14 +3080,17 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
  * Helper that checks whether a VCPU is eligible for directed yield.
  * Most eligible candidate to yield is decided by following heuristics:
  *
- *  (a) VCPU which has not done pl-exit or cpu relax intercepted recently
- *  (preempted lock holder), indicated by @in_spin_loop.
- *  Set at the beginning and cleared at the end of interception/PLE handler.
+ *  (a) VCPU which has not done pl-exit or cpu relax intercepted and is not
+ *  waiting for an interrupt recently (preempted lock holder). The former
+ *  one is indicated by @in_spin_loop, set at the beginning and cleared at
+ *  the end of interception/PLE handler. The later one is indicated by
+ *  @pending_interrupt, set when interrupt is delivering and cleared at
+ *  the end of directed yield.
  *
- *  (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
- *  chance last time (mostly it has become eligible now since we have probably
- *  yielded to lockholder in last iteration. This is done by toggling
- *  @dy_eligible each time a VCPU checked for eligibility.)
+ *  (b) VCPU which has done pl-exit/ cpu relax intercepted or is waiting for
+ *  interrupt but did not get chance last time (mostly it has become eligible
+ *  now since we have probably yielded to lockholder in last iteration. This
+ *  is done by toggling @dy_eligible each time a VCPU checked for eligibility.)
  *
  *  Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
  *  to preempted lock-holder could result in wrong VCPU selection and CPU
@@ -3102,10 +3106,10 @@ static bool
kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
     bool eligible;

-    eligible = !vcpu->spin_loop.in_spin_loop ||
+    eligible = !(vcpu->spin_loop.in_spin_loop ||
vcpu->spin_loop.pending_interrupt) ||
             vcpu->spin_loop.dy_eligible;

-    if (vcpu->spin_loop.in_spin_loop)
+    if (vcpu->spin_loop.in_spin_loop || vcpu->spin_loop.pending_interrupt)
         kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible);

     return eligible;
@@ -3137,6 +3141,16 @@ static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
     return false;
 }

+static bool kvm_has_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+    if (vcpu_dy_runnable(vcpu)) {
+        kvm_vcpu_set_pending_interrupt(vcpu, true);
+        return true;
+    }
+
+    return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
     struct kvm *kvm = me->kvm;
@@ -3170,6 +3184,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)
                 !vcpu_dy_runnable(vcpu))
                 continue;
             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !kvm_has_interrupt_delivery(vcpu) &&
                 !kvm_arch_vcpu_in_kernel(vcpu))
                 continue;
             if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
@@ -3177,6 +3192,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)

             yielded = kvm_vcpu_yield_to(vcpu);
             if (yielded > 0) {
+                kvm_vcpu_set_pending_interrupt(vcpu, false);
                 kvm->last_boosted_vcpu = i;
                 break;
             } else if (yielded < 0) {
Paolo Bonzini April 20, 2021, 7:22 a.m. UTC | #7
On 20/04/21 08:08, Wanpeng Li wrote:
> On Tue, 20 Apr 2021 at 14:02, Wanpeng Li <kernellwp@gmail.com> wrote:
>>
>> On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>> On 19/04/21 18:32, Sean Christopherson wrote:
>>>> If false positives are a big concern, what about adding another pass to the loop
>>>> and only yielding to usermode vCPUs with interrupts in the second full pass?
>>>> I.e. give vCPUs that are already in kernel mode priority, and only yield to
>>>> handle an interrupt if there are no vCPUs in kernel mode.
>>>>
>>>> kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.
>>>
>>> pv_unhalted won't help if you're waiting for a kernel spinlock though,
>>> would it?  Doing two passes (or looking for a "best" candidate that
>>> prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt)
>>> seems like the best choice overall.
>>
>> How about something like this:

I was thinking of something simpler:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b8e30dd5b9b..455c648f9adc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
  {
  	struct kvm *kvm = me->kvm;
  	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
  	int yielded = 0;
  	int try = 3;
-	int pass;
+	int pass, num_passes = 1;
  	int i;
  
  	kvm_vcpu_set_in_spin_loop(me, true);
@@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
  	 * VCPU is holding the lock that we need and will release it.
  	 * We approximate round-robin by starting at the last boosted VCPU.
  	 */
-	for (pass = 0; pass < 2 && !yielded && try; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i <= last_boosted_vcpu) {
-				i = last_boosted_vcpu;
-				continue;
-			} else if (pass && i > last_boosted_vcpu)
-				break;
+	for (pass = 0; pass < num_passes; pass++) {
+		int idx = me->kvm->last_boosted_vcpu;
+		int n = atomic_read(&kvm->online_vcpus);
+		for (i = 0; i < n; i++, idx++) {
+			if (idx == n)
+				idx = 0;
+
+			vcpu = kvm_get_vcpu(kvm, idx);
  			if (!READ_ONCE(vcpu->ready))
  				continue;
  			if (vcpu == me)
@@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
  			if (rcuwait_active(&vcpu->wait) &&
  			    !vcpu_dy_runnable(vcpu))
  				continue;
-			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
-				!kvm_arch_vcpu_in_kernel(vcpu))
-				continue;
  			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
  				continue;
  
+			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+			    !kvm_arch_vcpu_in_kernel(vcpu)) {
+			    /*
+			     * A vCPU running in userspace can get to kernel mode via
+			     * an interrupt.  That's a worse choice than a CPU already
+			     * in kernel mode so only do it on a second pass.
+			     */
+			    if (!vcpu_dy_runnable(vcpu))
+				    continue;
+			    if (pass == 0) {
+				    num_passes = 2;
+				    continue;
+			    }
+			}
+
  			yielded = kvm_vcpu_yield_to(vcpu);
  			if (yielded > 0) {
  				kvm->last_boosted_vcpu = i;
-				break;
+				goto done;
  			} else if (yielded < 0) {
  				try--;
  				if (!try)
-					break;
+					goto done;
  			}
  		}
  	}
+done:
  	kvm_vcpu_set_in_spin_loop(me, false);
  
  	/* Ensure vcpu is not eligible during next spinloop */

Paolo
Wanpeng Li April 20, 2021, 8:48 a.m. UTC | #8
On Tue, 20 Apr 2021 at 15:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 20/04/21 08:08, Wanpeng Li wrote:
> > On Tue, 20 Apr 2021 at 14:02, Wanpeng Li <kernellwp@gmail.com> wrote:
> >>
> >> On Tue, 20 Apr 2021 at 00:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>
> >>> On 19/04/21 18:32, Sean Christopherson wrote:
> >>>> If false positives are a big concern, what about adding another pass to the loop
> >>>> and only yielding to usermode vCPUs with interrupts in the second full pass?
> >>>> I.e. give vCPUs that are already in kernel mode priority, and only yield to
> >>>> handle an interrupt if there are no vCPUs in kernel mode.
> >>>>
> >>>> kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.
> >>>
> >>> pv_unhalted won't help if you're waiting for a kernel spinlock though,
> >>> would it?  Doing two passes (or looking for a "best" candidate that
> >>> prefers kernel mode vCPUs to user mode vCPUs waiting for an interrupt)
> >>> seems like the best choice overall.
> >>
> >> How about something like this:
>
> I was thinking of something simpler:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9b8e30dd5b9b..455c648f9adc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>   {
>         struct kvm *kvm = me->kvm;
>         struct kvm_vcpu *vcpu;
> -       int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>         int yielded = 0;
>         int try = 3;
> -       int pass;
> +       int pass, num_passes = 1;
>         int i;
>
>         kvm_vcpu_set_in_spin_loop(me, true);
> @@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>          * VCPU is holding the lock that we need and will release it.
>          * We approximate round-robin by starting at the last boosted VCPU.
>          */
> -       for (pass = 0; pass < 2 && !yielded && try; pass++) {
> -               kvm_for_each_vcpu(i, vcpu, kvm) {
> -                       if (!pass && i <= last_boosted_vcpu) {
> -                               i = last_boosted_vcpu;
> -                               continue;
> -                       } else if (pass && i > last_boosted_vcpu)
> -                               break;
> +       for (pass = 0; pass < num_passes; pass++) {
> +               int idx = me->kvm->last_boosted_vcpu;
> +               int n = atomic_read(&kvm->online_vcpus);
> +               for (i = 0; i < n; i++, idx++) {
> +                       if (idx == n)
> +                               idx = 0;
> +
> +                       vcpu = kvm_get_vcpu(kvm, idx);
>                         if (!READ_ONCE(vcpu->ready))
>                                 continue;
>                         if (vcpu == me)
> @@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>                         if (rcuwait_active(&vcpu->wait) &&
>                             !vcpu_dy_runnable(vcpu))
>                                 continue;
> -                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> -                               !kvm_arch_vcpu_in_kernel(vcpu))
> -                               continue;
>                         if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>                                 continue;
>
> +                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> +                           !kvm_arch_vcpu_in_kernel(vcpu)) {
> +                           /*
> +                            * A vCPU running in userspace can get to kernel mode via
> +                            * an interrupt.  That's a worse choice than a CPU already
> +                            * in kernel mode so only do it on a second pass.
> +                            */
> +                           if (!vcpu_dy_runnable(vcpu))
> +                                   continue;
> +                           if (pass == 0) {
> +                                   num_passes = 2;
> +                                   continue;
> +                           }
> +                       }
> +
>                         yielded = kvm_vcpu_yield_to(vcpu);
>                         if (yielded > 0) {
>                                 kvm->last_boosted_vcpu = i;
> -                               break;
> +                               goto done;
>                         } else if (yielded < 0) {
>                                 try--;
>                                 if (!try)
> -                                       break;
> +                                       goto done;
>                         }
>                 }
>         }
> +done:

We just tested the above post against 96 vCPUs VM in an over-subscribe
scenario, the score of pbzip2 fluctuated drastically. Sometimes it is
worse than vanilla, but the average improvement is around 2.2%. The
new version of my post is around 9.3%,the origial posted patch is
around 10% which is totally as expected since now both IPI receivers
in user-mode and lock-waiters are second class citizens. Big VM
increases the probability multiple vCPUs may enter PLE handler, the
previous vCPU who starts searching earlier can mark IPI receivers in
user-mode as dy_eligible, the vCPU who starts searching a little later
can select it directly. However, after the above posting, the
PLE-caused vCPU should search the second full pass by himself.

    Wanpeng
Paolo Bonzini April 20, 2021, 10:23 a.m. UTC | #9
On 20/04/21 10:48, Wanpeng Li wrote:
>> I was thinking of something simpler:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 9b8e30dd5b9b..455c648f9adc 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>>    {
>>          struct kvm *kvm = me->kvm;
>>          struct kvm_vcpu *vcpu;
>> -       int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>>          int yielded = 0;
>>          int try = 3;
>> -       int pass;
>> +       int pass, num_passes = 1;
>>          int i;
>>
>>          kvm_vcpu_set_in_spin_loop(me, true);
>> @@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>>           * VCPU is holding the lock that we need and will release it.
>>           * We approximate round-robin by starting at the last boosted VCPU.
>>           */
>> -       for (pass = 0; pass < 2 && !yielded && try; pass++) {
>> -               kvm_for_each_vcpu(i, vcpu, kvm) {
>> -                       if (!pass && i <= last_boosted_vcpu) {
>> -                               i = last_boosted_vcpu;
>> -                               continue;
>> -                       } else if (pass && i > last_boosted_vcpu)
>> -                               break;
>> +       for (pass = 0; pass < num_passes; pass++) {
>> +               int idx = me->kvm->last_boosted_vcpu;
>> +               int n = atomic_read(&kvm->online_vcpus);
>> +               for (i = 0; i < n; i++, idx++) {
>> +                       if (idx == n)
>> +                               idx = 0;
>> +
>> +                       vcpu = kvm_get_vcpu(kvm, idx);
>>                          if (!READ_ONCE(vcpu->ready))
>>                                  continue;
>>                          if (vcpu == me)
>> @@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>>                          if (rcuwait_active(&vcpu->wait) &&
>>                              !vcpu_dy_runnable(vcpu))
>>                                  continue;
>> -                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
>> -                               !kvm_arch_vcpu_in_kernel(vcpu))
>> -                               continue;
>>                          if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>                                  continue;
>>
>> +                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
>> +                           !kvm_arch_vcpu_in_kernel(vcpu)) {
>> +                           /*
>> +                            * A vCPU running in userspace can get to kernel mode via
>> +                            * an interrupt.  That's a worse choice than a CPU already
>> +                            * in kernel mode so only do it on a second pass.
>> +                            */
>> +                           if (!vcpu_dy_runnable(vcpu))
>> +                                   continue;
>> +                           if (pass == 0) {
>> +                                   num_passes = 2;
>> +                                   continue;
>> +                           }
>> +                       }
>> +
>>                          yielded = kvm_vcpu_yield_to(vcpu);
>>                          if (yielded > 0) {
>>                                  kvm->last_boosted_vcpu = i;
>> -                               break;
>> +                               goto done;
>>                          } else if (yielded < 0) {
>>                                  try--;
>>                                  if (!try)
>> -                                       break;
>> +                                       goto done;
>>                          }
>>                  }
>>          }
>> +done:
> 
> We just tested the above post against 96 vCPUs VM in an over-subscribe
> scenario, the score of pbzip2 fluctuated drastically. Sometimes it is
> worse than vanilla, but the average improvement is around 2.2%. The
> new version of my post is around 9.3%,the origial posted patch is
> around 10% which is totally as expected since now both IPI receivers
> in user-mode and lock-waiters are second class citizens.

Fair enough.  Of the two patches you posted I prefer the original, so 
I'll go with that one.

Paolo
Wanpeng Li April 20, 2021, 10:27 a.m. UTC | #10
On Tue, 20 Apr 2021 at 18:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 20/04/21 10:48, Wanpeng Li wrote:
> >> I was thinking of something simpler:
> >>
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 9b8e30dd5b9b..455c648f9adc 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -3198,10 +3198,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >>    {
> >>          struct kvm *kvm = me->kvm;
> >>          struct kvm_vcpu *vcpu;
> >> -       int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> >>          int yielded = 0;
> >>          int try = 3;
> >> -       int pass;
> >> +       int pass, num_passes = 1;
> >>          int i;
> >>
> >>          kvm_vcpu_set_in_spin_loop(me, true);
> >> @@ -3212,13 +3211,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >>           * VCPU is holding the lock that we need and will release it.
> >>           * We approximate round-robin by starting at the last boosted VCPU.
> >>           */
> >> -       for (pass = 0; pass < 2 && !yielded && try; pass++) {
> >> -               kvm_for_each_vcpu(i, vcpu, kvm) {
> >> -                       if (!pass && i <= last_boosted_vcpu) {
> >> -                               i = last_boosted_vcpu;
> >> -                               continue;
> >> -                       } else if (pass && i > last_boosted_vcpu)
> >> -                               break;
> >> +       for (pass = 0; pass < num_passes; pass++) {
> >> +               int idx = me->kvm->last_boosted_vcpu;
> >> +               int n = atomic_read(&kvm->online_vcpus);
> >> +               for (i = 0; i < n; i++, idx++) {
> >> +                       if (idx == n)
> >> +                               idx = 0;
> >> +
> >> +                       vcpu = kvm_get_vcpu(kvm, idx);
> >>                          if (!READ_ONCE(vcpu->ready))
> >>                                  continue;
> >>                          if (vcpu == me)
> >> @@ -3226,23 +3226,36 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >>                          if (rcuwait_active(&vcpu->wait) &&
> >>                              !vcpu_dy_runnable(vcpu))
> >>                                  continue;
> >> -                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> >> -                               !kvm_arch_vcpu_in_kernel(vcpu))
> >> -                               continue;
> >>                          if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> >>                                  continue;
> >>
> >> +                       if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> >> +                           !kvm_arch_vcpu_in_kernel(vcpu)) {
> >> +                           /*
> >> +                            * A vCPU running in userspace can get to kernel mode via
> >> +                            * an interrupt.  That's a worse choice than a CPU already
> >> +                            * in kernel mode so only do it on a second pass.
> >> +                            */
> >> +                           if (!vcpu_dy_runnable(vcpu))
> >> +                                   continue;
> >> +                           if (pass == 0) {
> >> +                                   num_passes = 2;
> >> +                                   continue;
> >> +                           }
> >> +                       }
> >> +
> >>                          yielded = kvm_vcpu_yield_to(vcpu);
> >>                          if (yielded > 0) {
> >>                                  kvm->last_boosted_vcpu = i;
> >> -                               break;
> >> +                               goto done;
> >>                          } else if (yielded < 0) {
> >>                                  try--;
> >>                                  if (!try)
> >> -                                       break;
> >> +                                       goto done;
> >>                          }
> >>                  }
> >>          }
> >> +done:
> >
> > We just tested the above post against 96 vCPUs VM in an over-subscribe
> > scenario, the score of pbzip2 fluctuated drastically. Sometimes it is
> > worse than vanilla, but the average improvement is around 2.2%. The
> > new version of my post is around 9.3%,the origial posted patch is
> > around 10% which is totally as expected since now both IPI receivers
> > in user-mode and lock-waiters are second class citizens.
>
> Fair enough.  Of the two patches you posted I prefer the original, so
> I'll go with that one.

Great! Thanks. :)

    Wanpeng
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d2dd3f..0f16fa5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11069,6 +11069,14 @@  bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.apicv_active && static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
+		return true;
+
+	return false;
+}
+
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.preempted_in_kernel;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b06d12..5012fc4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -954,6 +954,7 @@  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
+bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0a481e7..781d2db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3012,6 +3012,11 @@  static bool vcpu_dy_runnable(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+bool __weak kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
 	struct kvm *kvm = me->kvm;
@@ -3045,6 +3050,7 @@  void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 			    !vcpu_dy_runnable(vcpu))
 				continue;
 			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+				!kvm_arch_interrupt_delivery(vcpu) &&
 				!kvm_arch_vcpu_in_kernel(vcpu))
 				continue;
 			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))