KVM: VMX: optimize pi_wakeup_handler

Message ID	1648872113-24329-1-git-send-email-lirongqing@baidu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Li RongQing <lirongqing@baidu.com> To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, vkuznets@redhat.com Subject: [PATCH] KVM: VMX: optimize pi_wakeup_handler Date: Sat, 2 Apr 2022 12:01:53 +0800 Message-Id: <1648872113-24329-1-git-send-email-lirongqing@baidu.com> Precedence: bulk
Series	KVM: VMX: optimize pi_wakeup_handler \| expand KVM: VMX: optimize pi_wakeup_handler

Message ID

1648872113-24329-1-git-send-email-lirongqing@baidu.com (mailing list archive)

State

New, archived

Headers

From: Li RongQing <lirongqing@baidu.com>
To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
        vkuznets@redhat.com
Subject: [PATCH] KVM: VMX: optimize pi_wakeup_handler
Date: Sat,  2 Apr 2022 12:01:53 +0800
Message-Id: <1648872113-24329-1-git-send-email-lirongqing@baidu.com>
Precedence: bulk

Series

KVM: VMX: optimize pi_wakeup_handler | expand

Commit Message

Li,Rongqing April 2, 2022, 4:01 a.m. UTC

pi_wakeup_handler is used to wakeup the sleep vCPUs by posted irq
list_for_each_entry is used in it, and whose input is other function
per_cpu(), That cause that per_cpu() be invoked at least twice when
there is one sleep vCPU

so optimize pi_wakeup_handler it by reading once which is safe in
spinlock protection

and same to per CPU spinlock

Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini April 2, 2022, 8:14 a.m. UTC | #1

On 4/2/22 06:01, Li RongQing wrote:
> pi_wakeup_handler is used to wakeup the sleep vCPUs by posted irq
> list_for_each_entry is used in it, and whose input is other function
> per_cpu(), That cause that per_cpu() be invoked at least twice when
> there is one sleep vCPU
> 
> so optimize pi_wakeup_handler it by reading once which is safe in
> spinlock protection
> 
> and same to per CPU spinlock

What's the difference in the generated code?

Paolo

> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
>   arch/x86/kvm/vmx/posted_intr.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 5fdabf3..0dae431 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -214,17 +214,21 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>    */
>   void pi_wakeup_handler(void)
>   {
> +	struct list_head *wakeup_list;
>   	int cpu = smp_processor_id();
> +	raw_spinlock_t *spinlock;
>   	struct vcpu_vmx *vmx;
>   
> -	raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> -	list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu),
> -			    pi_wakeup_list) {
> +	spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
> +
> +	raw_spin_lock(spinlock);
> +	wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
> +	list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
>   
>   		if (pi_test_on(&vmx->pi_desc))
>   			kvm_vcpu_wake_up(&vmx->vcpu);
>   	}
> -	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> +	raw_spin_unlock(spinlock);
>   }
>   
>   void __init pi_init_cpu(int cpu)

Li,Rongqing April 2, 2022, 8:32 a.m. UTC | #2

> -----邮件原件-----
> 发件人: Paolo Bonzini <paolo.bonzini@gmail.com> 代表 Paolo Bonzini
> 发送时间: 2022年4月2日 16:14
> 收件人: Li,Rongqing <lirongqing@baidu.com>; kvm@vger.kernel.org;
> seanjc@google.com; vkuznets@redhat.com
> 主题: Re: [PATCH] KVM: VMX: optimize pi_wakeup_handler
> 
> On 4/2/22 06:01, Li RongQing wrote:
> > pi_wakeup_handler is used to wakeup the sleep vCPUs by posted irq
> > list_for_each_entry is used in it, and whose input is other function
> > per_cpu(), That cause that per_cpu() be invoked at least twice when
> > there is one sleep vCPU
> >
> > so optimize pi_wakeup_handler it by reading once which is safe in
> > spinlock protection
> >
> > and same to per CPU spinlock
> 
> What's the difference in the generated code?
> 

This reduces one fifth asm codes

Without this patch:

0000000000000400 <pi_wakeup_handler>:
 400:   e8 00 00 00 00          callq  405 <pi_wakeup_handler+0x5>
 405:   55                      push   %rbp
 406:   48 89 e5                mov    %rsp,%rbp
 409:   41 57                   push   %r15
 40b:   41 56                   push   %r14
 40d:   41 55                   push   %r13
 40f:   41 54                   push   %r12
 411:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
 418:   65 44 8b 2d 00 00 00    mov    %gs:0x0(%rip),%r13d        # 420 <pi_wakeup_handler+0x20>
 41f:   00
 420:   4d 63 ed                movslq %r13d,%r13
 423:   4c 89 e7                mov    %r12,%rdi
 426:   53                      push   %rbx
 427:   4a 03 3c ed 00 00 00    add    0x0(,%r13,8),%rdi
 42e:   00
 42f:   49 c7 c6 00 00 00 00    mov    $0x0,%r14
 436:   e8 00 00 00 00          callq  43b <pi_wakeup_handler+0x3b>
 43b:   4a 8b 0c ed 00 00 00    mov    0x0(,%r13,8),%rcx
 442:   00
 443:   4c 89 f0                mov    %r14,%rax
 446:   48 8b 14 01             mov    (%rcx,%rax,1),%rdx
 44a:   48 01 c8                add    %rcx,%rax
 44d:   48 39 c2                cmp    %rax,%rdx
 450:   74 3e                   je     490 <pi_wakeup_handler+0x90>
 452:   48 8d 9a 40 d9 ff ff    lea    -0x26c0(%rdx),%rbx
 459:   49 c7 c7 00 00 00 00    mov    $0x0,%r15
 460:   48 8b 83 a0 26 00 00    mov    0x26a0(%rbx),%rax
 467:   a8 01                   test   $0x1,%al
 469:   74 08                   je     473 <pi_wakeup_handler+0x73>
 46b:   48 89 df                mov    %rbx,%rdi
 46e:   e8 00 00 00 00          callq  473 <pi_wakeup_handler+0x73>
 473:   4b 8b 0c ef             mov    (%r15,%r13,8),%rcx
 477:   48 8b 93 c0 26 00 00    mov    0x26c0(%rbx),%rdx
 47e:   4c 89 f0                mov    %r14,%rax
 481:   48 01 c8                add    %rcx,%rax
 484:   48 8d 9a 40 d9 ff ff    lea    -0x26c0(%rdx),%rbx
 48b:   48 39 c2                cmp    %rax,%rdx
 48e:   75 d0                   jne    460 <pi_wakeup_handler+0x60>
 490:   49 01 cc                add    %rcx,%r12
 493:   41 c6 04 24 00          movb   $0x0,(%r12)
 498:   5b                      pop    %rbx
 499:   41 5c                   pop    %r12
 49b:   41 5d                   pop    %r13
 49d:   41 5e                   pop    %r14
 49f:   41 5f                   pop    %r15
 4a1:   5d                      pop    %rbp
 4a2:   c3                      retq
 4a3:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 4aa:   00 00
 4ac:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 4b3:   00 00 00
 4b6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 4bd:   00 00 00



With this patch

400:   e8 00 00 00 00          callq  405 <pi_wakeup_handler+0x5>
 405:   55                      push   %rbp
 406:   48 89 e5                mov    %rsp,%rbp
 409:   41 55                   push   %r13
 40b:   41 54                   push   %r12
 40d:   53                      push   %rbx
 40e:   49 c7 c5 00 00 00 00    mov    $0x0,%r13
 415:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
 41c:   65 8b 1d 00 00 00 00    mov    %gs:0x0(%rip),%ebx        # 423 <pi_wakeup_handler+0x23>
 423:   48 63 db                movslq %ebx,%rbx
 426:   4c 03 2c dd 00 00 00    add    0x0(,%rbx,8),%r13
 42d:   00
 42e:   4c 89 ef                mov    %r13,%rdi
 431:   e8 00 00 00 00          callq  436 <pi_wakeup_handler+0x36>
 436:   4c 03 24 dd 00 00 00    add    0x0(,%rbx,8),%r12
 43d:   00
 43e:   49 8b 04 24             mov    (%r12),%rax
 442:   49 39 c4                cmp    %rax,%r12
 445:   74 2d                   je     474 <pi_wakeup_handler+0x74>
 447:   48 8d 98 40 d9 ff ff    lea    -0x26c0(%rax),%rbx
 44e:   48 8b 83 a0 26 00 00    mov    0x26a0(%rbx),%rax
 455:   a8 01                   test   $0x1,%al
 457:   74 08                   je     461 <pi_wakeup_handler+0x61>
 459:   48 89 df                mov    %rbx,%rdi
 45c:   e8 00 00 00 00          callq  461 <pi_wakeup_handler+0x61>
 461:   48 8b 83 c0 26 00 00    mov    0x26c0(%rbx),%rax
 468:   49 39 c4                cmp    %rax,%r12
 46b:   48 8d 98 40 d9 ff ff    lea    -0x26c0(%rax),%rbx
 472:   75 da                   jne    44e <pi_wakeup_handler+0x4e>
 474:   41 c6 45 00 00          movb   $0x0,0x0(%r13)
 479:   5b                      pop    %rbx
 47a:   41 5c                   pop    %r12
 47c:   41 5d                   pop    %r13
 47e:   5d                      pop    %rbp
 47f:   c3                      retq


these is a similar patch 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7

-Li

Sean Christopherson April 4, 2022, 3:15 p.m. UTC | #3

On Sat, Apr 02, 2022, Li,Rongqing wrote:
> > 发件人: Paolo Bonzini <paolo.bonzini@gmail.com> 代表 Paolo Bonzini
> > On 4/2/22 06:01, Li RongQing wrote:
> > > pi_wakeup_handler is used to wakeup the sleep vCPUs by posted irq
> > > list_for_each_entry is used in it, and whose input is other function
> > > per_cpu(), That cause that per_cpu() be invoked at least twice when
> > > there is one sleep vCPU
> > >
> > > so optimize pi_wakeup_handler it by reading once which is safe in
> > > spinlock protection

There's no need to protect reading the per-cpu variable with the spinlock, only
walking the list needs to be protected.  E.g. the code can be compacted to:

	int cpu = smp_processor_id();
	raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
	struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
	struct vcpu_vmx *vmx;

	raw_spin_lock(spinlock);
	list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
		if (pi_test_on(&vmx->pi_desc))
			kvm_vcpu_wake_up(&vmx->vcpu);
	}
	raw_spin_unlock(spinlock);

> > >
> > > and same to per CPU spinlock
> > 
> > What's the difference in the generated code?
> > 
> 
> This reduces one fifth asm codes

...

> these is a similar patch 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7

Is there a measurable performance improvement though?  I don't dislike the patch,
but it probably falls into the "technically an optimization but no one will ever
notice" category.

Li,Rongqing April 5, 2022, 8:26 a.m. UTC | #4

> There's no need to protect reading the per-cpu variable with the spinlock, only
> walking the list needs to be protected.  E.g. the code can be compacted to:

Thanks

> > >
> > > What's the difference in the generated code?
> > >
> >
> > This reduces one fifth asm codes
> 
> ...
> 
> > these is a similar patch 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
> 
> Is there a measurable performance improvement though?  I don't dislike the
> patch, but it probably falls into the "technically an optimization but no one will
> ever notice" category.

There is small performance improvement when "perf bench sched pipe" is tested on IPI virtualization supported cpu and halt/mwait donot passthrough to vm
(https://patchwork.kernel.org/project/kvm/patch/20220304080725.18135-9-guang.zeng@intel.com/ are included)


Thanks

-Li

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5fdabf3..0dae431 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -214,17 +214,21 @@  void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
  */
 void pi_wakeup_handler(void)
 {
+	struct list_head *wakeup_list;
 	int cpu = smp_processor_id();
+	raw_spinlock_t *spinlock;
 	struct vcpu_vmx *vmx;
 
-	raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
-	list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu),
-			    pi_wakeup_list) {
+	spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
+
+	raw_spin_lock(spinlock);
+	wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
+	list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
 
 		if (pi_test_on(&vmx->pi_desc))
 			kvm_vcpu_wake_up(&vmx->vcpu);
 	}
-	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
+	raw_spin_unlock(spinlock);
 }
 
 void __init pi_init_cpu(int cpu)

KVM: VMX: optimize pi_wakeup_handler

Commit Message

Comments

Patch