[v4] KVM: halt-polling: poll for the upcoming fire timers

Message ID	1464076674-4024-1-git-send-email-wanpeng.li@hotmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wanpeng Li <kernellwp@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Wanpeng Li <wanpeng.li@hotmail.com>, Paolo Bonzini <pbonzini@redhat.com>, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= <rkrcmar@redhat.com>, David Matlack <dmatlack@google.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Yang Zhang <yang.zhang.wz@gmail.com> Subject: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers Date: Tue, 24 May 2016 15:57:54 +0800 Message-Id: <1464076674-4024-1-git-send-email-wanpeng.li@hotmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Wanpeng Li May 24, 2016, 7:57 a.m. UTC

From: Wanpeng Li <wanpeng.li@hotmail.com>

If an emulated lapic timer will fire soon(in the scope of 10us the
base of dynamic halt-polling, lower-end of message passing workload
latency TCP_RR's poll time < 10us) we can treat it as a short halt,
and poll to wait it fire, the fire callback apic_timer_fn() will set
KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
This can avoid context switch overhead and the latency which we wake
up vCPU.

This feature is slightly different from current advance expiration 
way. Advance expiration rely on the vCPU is running(do polling before 
vmentry). But in some cases, the timer interrupt may be blocked by 
other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to 
run immediately. So even advance the timer early, vCPU may still see 
the latency. But polling is different, it ensures the vCPU to aware 
the timer expiration before schedule out.

echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v3 -> v4:
 * add module parameter halt_poll_ns_timer
 * rename patch subject since lapic maybe just for x86.
v2 -> v3:
 * add Yang's statement to patch description
v1 -> v2:
 * add return statement to non-x86 archs
 * capture never expire case for x86 (hrtimer is not started)

 arch/arm/include/asm/kvm_host.h     |  4 ++++
 arch/arm64/include/asm/kvm_host.h   |  4 ++++
 arch/mips/include/asm/kvm_host.h    |  4 ++++
 arch/powerpc/include/asm/kvm_host.h |  4 ++++
 arch/s390/include/asm/kvm_host.h    |  4 ++++
 arch/x86/kvm/lapic.c                | 11 +++++++++++
 arch/x86/kvm/lapic.h                |  1 +
 arch/x86/kvm/x86.c                  |  5 +++++
 include/linux/kvm_host.h            |  1 +
 virt/kvm/kvm_main.c                 | 15 +++++++++++----
 10 files changed, 49 insertions(+), 4 deletions(-)

Christian Borntraeger May 24, 2016, 5:15 p.m. UTC | #1

On 05/24/2016 09:57 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> If an emulated lapic timer will fire soon(in the scope of 10us the
> base of dynamic halt-polling, lower-end of message passing workload
> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
> and poll to wait it fire, the fire callback apic_timer_fn() will set
> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
> This can avoid context switch overhead and the latency which we wake
> up vCPU.
> 
> This feature is slightly different from current advance expiration 
> way. Advance expiration rely on the vCPU is running(do polling before 
> vmentry). But in some cases, the timer interrupt may be blocked by 
> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to 
> run immediately. So even advance the timer early, vCPU may still see 
> the latency. But polling is different, it ensures the vCPU to aware 
> the timer expiration before schedule out.
> 
> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
> 
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: David Matlack <dmatlack@google.com>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
> v3 -> v4:
>  * add module parameter halt_poll_ns_timer
>  * rename patch subject since lapic maybe just for x86.
> v2 -> v3:
>  * add Yang's statement to patch description
> v1 -> v2:
>  * add return statement to non-x86 archs
>  * capture never expire case for x86 (hrtimer is not started)
> 
>  arch/arm/include/asm/kvm_host.h     |  4 ++++
>  arch/arm64/include/asm/kvm_host.h   |  4 ++++
>  arch/mips/include/asm/kvm_host.h    |  4 ++++
>  arch/powerpc/include/asm/kvm_host.h |  4 ++++
>  arch/s390/include/asm/kvm_host.h    |  4 ++++
>  arch/x86/kvm/lapic.c                | 11 +++++++++++
>  arch/x86/kvm/lapic.h                |  1 +
>  arch/x86/kvm/x86.c                  |  5 +++++
>  include/linux/kvm_host.h            |  1 +
>  virt/kvm/kvm_main.c                 | 15 +++++++++++----
>  10 files changed, 49 insertions(+), 4 deletions(-)
[...]

@@ -1966,7 +1970,7 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>  	grow = READ_ONCE(halt_poll_ns_grow);
>  	/* 10us base */
>  	if (val == 0 && grow)
> -		val = 10000;
> +		val = halt_poll_ns_timer;

Drop this hunk and leave this at 10000, so that a user can disable the timer 
logic, but keep the old polling?


>  	else
>  		val *= grow;
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Matlack May 24, 2016, 10:38 p.m. UTC | #2

On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
>
> If an emulated lapic timer will fire soon(in the scope of 10us the
> base of dynamic halt-polling, lower-end of message passing workload
> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
> and poll to wait it fire, the fire callback apic_timer_fn() will set
> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
> This can avoid context switch overhead and the latency which we wake
> up vCPU.
>
> This feature is slightly different from current advance expiration
> way. Advance expiration rely on the vCPU is running(do polling before
> vmentry). But in some cases, the timer interrupt may be blocked by
> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
> run immediately. So even advance the timer early, vCPU may still see
> the latency. But polling is different, it ensures the vCPU to aware
> the timer expiration before schedule out.
>
> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll

These results aren't very compelling. Sometimes polling is faster,
sometimes vanilla is faster, sometimes they are about the same.

I imagine there are hyper sensitive workloads which cannot tolerate a
long tail in timer latency (e.g. realtime workloads). I would expect a
patch like this to provide a "smoothing effect", reducing that tail.
But for cloud/server workloads, I would not expect any sensitivity to
jitter in timer latency (especially while the VCPU is halted).

Note that while halt-polling happens when the CPU is idle, it's still
not free. It constricts the scheduler's cpu load balancer, because the
CPU appears to be busy. In KVM's default configuration, I'd prefer to
only add more polling when the gain is clear. If there are guest
workloads that want this patch, I'd suggest polling for timers be
default-off. At minimum, there should be a module parameter to control
it (like Christian Borntraeger suggested).

>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: David Matlack <dmatlack@google.com>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
> v3 -> v4:
>  * add module parameter halt_poll_ns_timer
>  * rename patch subject since lapic maybe just for x86.
> v2 -> v3:
>  * add Yang's statement to patch description
> v1 -> v2:
>  * add return statement to non-x86 archs
>  * capture never expire case for x86 (hrtimer is not started)
>
>  arch/arm/include/asm/kvm_host.h     |  4 ++++
>  arch/arm64/include/asm/kvm_host.h   |  4 ++++
>  arch/mips/include/asm/kvm_host.h    |  4 ++++
>  arch/powerpc/include/asm/kvm_host.h |  4 ++++
>  arch/s390/include/asm/kvm_host.h    |  4 ++++
>  arch/x86/kvm/lapic.c                | 11 +++++++++++
>  arch/x86/kvm/lapic.h                |  1 +
>  arch/x86/kvm/x86.c                  |  5 +++++
>  include/linux/kvm_host.h            |  1 +
>  virt/kvm/kvm_main.c                 | 15 +++++++++++----
>  10 files changed, 49 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 0df6b1f..fdfbed9 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -292,6 +292,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return -1ULL;
> +}
>
>  static inline void kvm_arm_init_debug(void) {}
>  static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index e63d23b..f510d71 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -371,6 +371,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return -1ULL;
> +}
>
>  void kvm_arm_init_debug(void);
>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 6733ac5..baf9472 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -814,6 +814,10 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return -1ULL;
> +}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
>  #endif /* __MIPS_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index ec35af3..5986c79 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -729,5 +729,9 @@ static inline void kvm_arch_exit(void) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return -1ULL;
> +}
>
>  #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index 37b9017..bdb01a1 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -696,6 +696,10 @@ static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>                 struct kvm_memory_slot *slot) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return -1ULL;
> +}
>
>  void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index bbb5b28..cfeeac3 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -256,6 +256,17 @@ static inline int apic_lvtt_tscdeadline(struct kvm_lapic *apic)
>         return apic->lapic_timer.timer_mode == APIC_LVT_TIMER_TSCDEADLINE;
>  }
>
> +u64 apic_get_timer_expire(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +       struct hrtimer *timer = &apic->lapic_timer.timer;
> +
> +       if (!hrtimer_active(timer))
> +               return -1ULL;
> +       else
> +               return ktime_to_ns(hrtimer_get_remaining(timer));
> +}
> +
>  static inline int apic_lvt_nmi_mode(u32 lvt_val)
>  {
>         return (lvt_val & (APIC_MODE_MASK | APIC_LVT_MASKED)) == APIC_DM_NMI;
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 891c6da..ee4da6c 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -212,4 +212,5 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
>                         struct kvm_vcpu **dest_vcpu);
>  int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
>                         const unsigned long *bitmap, u32 bitmap_size);
> +u64 apic_get_timer_expire(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c805cf4..1b89a68 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7623,6 +7623,11 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
>  struct static_key kvm_no_apic_vcpu __read_mostly;
>  EXPORT_SYMBOL_GPL(kvm_no_apic_vcpu);
>
> +u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +       return apic_get_timer_expire(vcpu);
> +}
> +
>  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  {
>         struct page *page;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b1fa8f1..14d6c23 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -663,6 +663,7 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
>  void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
>  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
>  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> +u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu);
>
>  void kvm_flush_remote_tlbs(struct kvm *kvm);
>  void kvm_reload_remote_mmus(struct kvm *kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dd4ac9d..afd15ba 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -78,6 +78,10 @@ module_param(halt_poll_ns_grow, uint, S_IRUGO | S_IWUSR);
>  static unsigned int halt_poll_ns_shrink;
>  module_param(halt_poll_ns_shrink, uint, S_IRUGO | S_IWUSR);
>
> +/* lower-end of message passing workload latency TCP_RR's poll time < 10us */
> +static unsigned int halt_poll_ns_timer = 10000;
> +module_param(halt_poll_ns_timer, uint, S_IRUGO | S_IWUSR);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -1966,7 +1970,7 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>         grow = READ_ONCE(halt_poll_ns_grow);
>         /* 10us base */
>         if (val == 0 && grow)
> -               val = 10000;
> +               val = halt_poll_ns_timer;
>         else
>                 val *= grow;
>
> @@ -2014,12 +2018,15 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>         ktime_t start, cur;
>         DECLARE_SWAITQUEUE(wait);
>         bool waited = false;
> -       u64 block_ns;
> +       u64 block_ns, delta, remaining;
>
> +       remaining = kvm_arch_timer_remaining(vcpu);
>         start = cur = ktime_get();
> -       if (vcpu->halt_poll_ns) {
> -               ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> +       if (vcpu->halt_poll_ns || remaining < halt_poll_ns_timer) {
> +               ktime_t stop;
>
> +               delta = vcpu->halt_poll_ns ? vcpu->halt_poll_ns : remaining;
> +               stop = ktime_add_ns(ktime_get(), delta);
>                 ++vcpu->stat.halt_attempted_poll;
>                 do {
>                         /*
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wanpeng Li May 24, 2016, 11:11 p.m. UTC | #3

2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@google.com>:
> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> If an emulated lapic timer will fire soon(in the scope of 10us the
>> base of dynamic halt-polling, lower-end of message passing workload
>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>> This can avoid context switch overhead and the latency which we wake
>> up vCPU.
>>
>> This feature is slightly different from current advance expiration
>> way. Advance expiration rely on the vCPU is running(do polling before
>> vmentry). But in some cases, the timer interrupt may be blocked by
>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>> run immediately. So even advance the timer early, vCPU may still see
>> the latency. But polling is different, it ensures the vCPU to aware
>> the timer expiration before schedule out.
>>
>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>
>> Context switching - times in microseconds - smaller is better
>> -------------------------------------------------------------------------
>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>
> These results aren't very compelling. Sometimes polling is faster,
> sometimes vanilla is faster, sometimes they are about the same.

More processes and bigger cache footprints can get benefit from the
result since I open the hrtimer for the precision preemption. Actually
I try to emulate Yang's workload, https://lkml.org/lkml/2016/5/22/162.
And his real workload can get more benefit as he mentioned,
https://lkml.org/lkml/2016/5/19/667.

> I imagine there are hyper sensitive workloads which cannot tolerate a
> long tail in timer latency (e.g. realtime workloads). I would expect a
> patch like this to provide a "smoothing effect", reducing that tail.
> But for cloud/server workloads, I would not expect any sensitivity to
> jitter in timer latency (especially while the VCPU is halted).

Yang's is real cloud workload.

>
> Note that while halt-polling happens when the CPU is idle, it's still
> not free. It constricts the scheduler's cpu load balancer, because the
> CPU appears to be busy. In KVM's default configuration, I'd prefer to
> only add more polling when the gain is clear. If there are guest
> workloads that want this patch, I'd suggest polling for timers be
> default-off. At minimum, there should be a module parameter to control
> it (like Christian Borntraeger suggested).

Yeah, I will add the module parameter in order to enable/disable.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wanpeng Li May 24, 2016, 11:16 p.m. UTC | #4

2016-05-25 1:15 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> On 05/24/2016 09:57 AM, Wanpeng Li wrote:
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> If an emulated lapic timer will fire soon(in the scope of 10us the
>> base of dynamic halt-polling, lower-end of message passing workload
>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>> This can avoid context switch overhead and the latency which we wake
>> up vCPU.
>>
>> This feature is slightly different from current advance expiration
>> way. Advance expiration rely on the vCPU is running(do polling before
>> vmentry). But in some cases, the timer interrupt may be blocked by
>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>> run immediately. So even advance the timer early, vCPU may still see
>> the latency. But polling is different, it ensures the vCPU to aware
>> the timer expiration before schedule out.
>>
>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>
>> Context switching - times in microseconds - smaller is better
>> -------------------------------------------------------------------------
>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Cc: David Matlack <dmatlack@google.com>
>> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
>> Cc: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> ---
>> v3 -> v4:
>>  * add module parameter halt_poll_ns_timer
>>  * rename patch subject since lapic maybe just for x86.
>> v2 -> v3:
>>  * add Yang's statement to patch description
>> v1 -> v2:
>>  * add return statement to non-x86 archs
>>  * capture never expire case for x86 (hrtimer is not started)
>>
>>  arch/arm/include/asm/kvm_host.h     |  4 ++++
>>  arch/arm64/include/asm/kvm_host.h   |  4 ++++
>>  arch/mips/include/asm/kvm_host.h    |  4 ++++
>>  arch/powerpc/include/asm/kvm_host.h |  4 ++++
>>  arch/s390/include/asm/kvm_host.h    |  4 ++++
>>  arch/x86/kvm/lapic.c                | 11 +++++++++++
>>  arch/x86/kvm/lapic.h                |  1 +
>>  arch/x86/kvm/x86.c                  |  5 +++++
>>  include/linux/kvm_host.h            |  1 +
>>  virt/kvm/kvm_main.c                 | 15 +++++++++++----
>>  10 files changed, 49 insertions(+), 4 deletions(-)
> [...]
>
> @@ -1966,7 +1970,7 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>>       grow = READ_ONCE(halt_poll_ns_grow);
>>       /* 10us base */
>>       if (val == 0 && grow)
>> -             val = 10000;
>> +             val = halt_poll_ns_timer;
>
> Drop this hunk and leave this at 10000, so that a user can disable the timer
> logic, but keep the old polling?

I see, sorry for misunderstand you in v3.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Matlack May 24, 2016, 11:37 p.m. UTC | #5

On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@google.com>:
>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>
>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>> base of dynamic halt-polling, lower-end of message passing workload
>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>> This can avoid context switch overhead and the latency which we wake
>>> up vCPU.
>>>
>>> This feature is slightly different from current advance expiration
>>> way. Advance expiration rely on the vCPU is running(do polling before
>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>> run immediately. So even advance the timer early, vCPU may still see
>>> the latency. But polling is different, it ensures the vCPU to aware
>>> the timer expiration before schedule out.
>>>
>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>>
>>> Context switching - times in microseconds - smaller is better
>>> -------------------------------------------------------------------------
>>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>
>> These results aren't very compelling. Sometimes polling is faster,
>> sometimes vanilla is faster, sometimes they are about the same.
>
> More processes and bigger cache footprints can get benefit from the
> result since I open the hrtimer for the precision preemption.

The VCPU is halted (idle), so the timer interrupt is not preempting
anything. Also I would not expect any preemption in a context
switching benchmark, the threads should be handing off execution to
one another.

I'm confused why timers would play any role in the performance of this
benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K
runs?

> Actually
> I try to emulate Yang's workload, https://lkml.org/lkml/2016/5/22/162.
> And his real workload can get more benefit as he mentioned,
> https://lkml.org/lkml/2016/5/19/667.
>
>> I imagine there are hyper sensitive workloads which cannot tolerate a
>> long tail in timer latency (e.g. realtime workloads). I would expect a
>> patch like this to provide a "smoothing effect", reducing that tail.
>> But for cloud/server workloads, I would not expect any sensitivity to
>> jitter in timer latency (especially while the VCPU is halted).
>
> Yang's is real cloud workload.

I have 2 issues with optimizing for Yang's workload. Yang, please
correct me if I am mis-characterizing it.
1. The delay in timer interrupts is caused by something disabling the
interrupts on the CPU for more than a millisecond. It seems that is
the real issue. I'm wary of using polling as a workaround.
2. The delay is caused by a separate task. Halt-polling would not help
in that scenario, it would yield the CPU to that task.

>
>>
>> Note that while halt-polling happens when the CPU is idle, it's still
>> not free. It constricts the scheduler's cpu load balancer, because the
>> CPU appears to be busy. In KVM's default configuration, I'd prefer to
>> only add more polling when the gain is clear. If there are guest
>> workloads that want this patch, I'd suggest polling for timers be
>> default-off. At minimum, there should be a module parameter to control
>> it (like Christian Borntraeger suggested).
>
> Yeah, I will add the module parameter in order to enable/disable.
>
> Regards,
> Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wanpeng Li May 25, 2016, 12:47 a.m. UTC | #6

2016-05-25 7:37 GMT+08:00 David Matlack <dmatlack@google.com>:
> On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
>> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@google.com>:
>>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
>>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>>
>>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>>> base of dynamic halt-polling, lower-end of message passing workload
>>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>>> This can avoid context switch overhead and the latency which we wake
>>>> up vCPU.
>>>>
>>>> This feature is slightly different from current advance expiration
>>>> way. Advance expiration rely on the vCPU is running(do polling before
>>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>>> run immediately. So even advance the timer early, vCPU may still see
>>>> the latency. But polling is different, it ensures the vCPU to aware
>>>> the timer expiration before schedule out.
>>>>
>>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.


                    ^^^^^^^^^^^^^^^^^

>>>>
>>>> Context switching - times in microseconds - smaller is better
>>>> -------------------------------------------------------------------------
>>>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>>>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>>
>>> These results aren't very compelling. Sometimes polling is faster,
>>> sometimes vanilla is faster, sometimes they are about the same.
>>
>> More processes and bigger cache footprints can get benefit from the
>> result since I open the hrtimer for the precision preemption.
>
> The VCPU is halted (idle), so the timer interrupt is not preempting
> anything. Also I would not expect any preemption in a context
> switching benchmark, the threads should be handing off execution to
> one another.
>
> I'm confused why timers would play any role in the performance of this
> benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K
> runs?

https://lwn.net/Articles/254512/, I open HRTICK for high-res
preemption tick in dynticks guests instead of host. So task switch
will trigger by hrtimer fire in guests.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wanpeng Li May 25, 2016, 1:29 a.m. UTC | #7

2016-05-25 8:47 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-05-25 7:37 GMT+08:00 David Matlack <dmatlack@google.com>:
>> On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
>>> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@google.com>:
>>>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
>>>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>>>
>>>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>>>> base of dynamic halt-polling, lower-end of message passing workload
>>>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>>>> This can avoid context switch overhead and the latency which we wake
>>>>> up vCPU.
>>>>>
>>>>> This feature is slightly different from current advance expiration
>>>>> way. Advance expiration rely on the vCPU is running(do polling before
>>>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>>>> run immediately. So even advance the timer early, vCPU may still see
>>>>> the latency. But polling is different, it ensures the vCPU to aware
>>>>> the timer expiration before schedule out.
>>>>>
>>>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>
>
>                     ^^^^^^^^^^^^^^^^^
>
>>>>>
>>>>> Context switching - times in microseconds - smaller is better
>>>>> -------------------------------------------------------------------------
>>>>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>>>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>>>>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>>>
>>>> These results aren't very compelling. Sometimes polling is faster,
>>>> sometimes vanilla is faster, sometimes they are about the same.

Pin vCPUs get more difference.

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
kernel     Linux 4.6.0+   11.6   14.0   11.8   53.1   12.5 8.16000
11.4 vanilla
kernel     Linux 4.6.0+   45.8   15.1 2.3000   12.9 1.4200    14.6 4.52000 poll

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Zhang May 25, 2016, 2:10 a.m. UTC | #8

On 2016/5/25 7:37, David Matlack wrote:
> On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
>> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@google.com>:
>>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@gmail.com> wrote:
>>>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>>>
>>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>>> base of dynamic halt-polling, lower-end of message passing workload
>>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>>> This can avoid context switch overhead and the latency which we wake
>>>> up vCPU.
>>>>
>>>> This feature is slightly different from current advance expiration
>>>> way. Advance expiration rely on the vCPU is running(do polling before
>>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>>> run immediately. So even advance the timer early, vCPU may still see
>>>> the latency. But polling is different, it ensures the vCPU to aware
>>>> the timer expiration before schedule out.
>>>>
>>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>>>
>>>> Context switching - times in microseconds - smaller is better
>>>> -------------------------------------------------------------------------
>>>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>>>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>>
>>> These results aren't very compelling. Sometimes polling is faster,
>>> sometimes vanilla is faster, sometimes they are about the same.
>>
>> More processes and bigger cache footprints can get benefit from the
>> result since I open the hrtimer for the precision preemption.
>
> The VCPU is halted (idle), so the timer interrupt is not preempting
> anything. Also I would not expect any preemption in a context
> switching benchmark, the threads should be handing off execution to
> one another.
>
> I'm confused why timers would play any role in the performance of this
> benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K
> runs?
>
>> Actually
>> I try to emulate Yang's workload, https://lkml.org/lkml/2016/5/22/162.
>> And his real workload can get more benefit as he mentioned,
>> https://lkml.org/lkml/2016/5/19/667.
>>
>>> I imagine there are hyper sensitive workloads which cannot tolerate a
>>> long tail in timer latency (e.g. realtime workloads). I would expect a
>>> patch like this to provide a "smoothing effect", reducing that tail.
>>> But for cloud/server workloads, I would not expect any sensitivity to
>>> jitter in timer latency (especially while the VCPU is halted).
>>
>> Yang's is real cloud workload.
>
> I have 2 issues with optimizing for Yang's workload. Yang, please
> correct me if I am mis-characterizing it.
> 1. The delay in timer interrupts is caused by something disabling the
> interrupts on the CPU for more than a millisecond. It seems that is
> the real issue. I'm wary of using polling as a workaround.

Yes, this is the most likely case.

> 2. The delay is caused by a separate task. Halt-polling would not help
> in that scenario, it would yield the CPU to that task.

In some cases, the separate task is migrated from other CPU after CPU 
enter idle state. So Halt-polling may still help. And the delay is 
caused by two context switches(VCPU schedule out and migrate VCPU to 
another idle CPU).

>
>>
>>>
>>> Note that while halt-polling happens when the CPU is idle, it's still
>>> not free. It constricts the scheduler's cpu load balancer, because the
>>> CPU appears to be busy. In KVM's default configuration, I'd prefer to
>>> only add more polling when the gain is clear. If there are guest
>>> workloads that want this patch, I'd suggest polling for timers be
>>> default-off. At minimum, there should be a module parameter to control
>>> it (like Christian Borntraeger suggested).
>>
>> Yeah, I will add the module parameter in order to enable/disable.
>>
>> Regards,
>> Wanpeng Li

[v4] KVM: halt-polling: poll for the upcoming fire timers

Commit Message

Comments

Patch