Message ID | 20190517174857.GA8611@amt.cnet (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: add cpuidle_kvm driver to allow guest side halt polling | expand |
On 17/05/19 19:48, Marcelo Tosatti wrote: > > The cpuidle_kvm driver allows the guest vcpus to poll for a specified > amount of time before halting. This provides the following benefits > to host side polling: > > 1) The POLL flag is set while polling is performed, which allows > a remote vCPU to avoid sending an IPI (and the associated > cost of handling the IPI) when performing a wakeup. > > 2) The HLT VM-exit cost can be avoided. > > The downside of guest side polling is that polling is performed > even with other runnable tasks in the host. > > Results comparing halt_poll_ns and server/client application > where a small packet is ping-ponged: > > host --> 31.33 > halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > For the SAP HANA benchmarks (where idle_spin is a parameter > of the previous version of the patch, results should be the > same): > > hpns == halt_poll_ns > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > hpns=200000 hpns=0 hpns=800000 > DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) Hi Marcelo, some quick observations: 1) This is actually not KVM-specific, so the name and placement of the docs should be adjusted. 2) Regarding KVM-specific code, however, we could add an MSR so that KVM disables halt_poll_ns for this VM when this is active in the guest? 3) The spin time could use the same adaptive algorithm that KVM uses in the host. Thanks, Paolo > --- > Documentation/virtual/kvm/guest-halt-polling.txt | 39 ++++++++ > arch/x86/Kconfig | 9 + > arch/x86/kernel/Makefile | 1 > arch/x86/kernel/cpuidle_kvm.c | 105 +++++++++++++++++++++++ > arch/x86/kernel/process.c | 2 > 5 files changed, 155 insertions(+), 1 deletion(-) > > Index: linux-2.6.git/arch/x86/Kconfig > =================================================================== > --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 > +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 > @@ -805,6 +805,15 @@ > underlying device model, the host provides the guest with > timing infrastructure such as time of day, and system time > > +config KVM_CPUIDLE > + tristate "KVM cpuidle driver" > + depends on KVM_GUEST > + default y > + help > + This option enables KVM cpuidle driver, which allows to poll > + before halting in the guest (more efficient than polling in the > + host via halt_poll_ns for some scenarios). > + > config PVH > bool "Support for running PVH guests" > ---help--- > Index: linux-2.6.git/arch/x86/kernel/Makefile > =================================================================== > --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 > +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 > @@ -112,6 +112,7 @@ > obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o > > obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o > +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o > obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o > obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o > obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o > Index: linux-2.6.git/arch/x86/kernel/process.c > =================================================================== > --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 > +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 > @@ -580,7 +580,7 @@ > safe_halt(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > } > -#ifdef CONFIG_APM_MODULE > +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) > EXPORT_SYMBOL(default_idle); > #endif > > Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 > @@ -0,0 +1,105 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * cpuidle driver for KVM guests. > + * > + * Copyright 2019 Red Hat, Inc. and/or its affiliates. > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + * > + * Authors: Marcelo Tosatti <mtosatti@redhat.com> > + */ > + > +#include <linux/init.h> > +#include <linux/cpuidle.h> > +#include <linux/module.h> > +#include <linux/timekeeping.h> > +#include <linux/sched/idle.h> > + > +unsigned int guest_halt_poll_ns; > +module_param(guest_halt_poll_ns, uint, 0644); > + > +static int kvm_enter_idle(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, int index) > +{ > + int do_halt = 0; > + > + /* No polling */ > + if (guest_halt_poll_ns == 0) { > + if (current_clr_polling_and_test()) { > + local_irq_enable(); > + return index; > + } > + default_idle(); > + return index; > + } > + > + local_irq_enable(); > + if (!current_set_polling_and_test()) { > + ktime_t now, end_spin; > + > + now = ktime_get(); > + end_spin = ktime_add_ns(now, guest_halt_poll_ns); > + > + while (!need_resched()) { > + cpu_relax(); > + now = ktime_get(); > + > + if (!ktime_before(now, end_spin)) { > + do_halt = 1; > + break; > + } > + } > + } > + > + if (do_halt) { > + /* > + * No events while busy spin window passed, > + * halt. > + */ > + local_irq_disable(); > + if (current_clr_polling_and_test()) { > + local_irq_enable(); > + return index; > + } > + default_idle(); > + } else { > + current_clr_polling(); > + } > + > + return index; > +} > + > +static struct cpuidle_driver kvm_idle_driver = { > + .name = "kvm_idle", > + .owner = THIS_MODULE, > + .states = { > + { /* entry 0 is for polling */ }, > + { > + .enter = kvm_enter_idle, > + .exit_latency = 0, > + .target_residency = 0, > + .power_usage = -1, > + .name = "KVM", > + .desc = "KVM idle", > + }, > + }, > + .safe_state_index = 0, > + .state_count = 2, > +}; > + > +static int __init kvm_cpuidle_init(void) > +{ > + return cpuidle_register(&kvm_idle_driver, NULL); > +} > + > +static void __exit kvm_cpuidle_exit(void) > +{ > + cpuidle_unregister(&kvm_idle_driver); > +} > + > +module_init(kvm_cpuidle_init); > +module_exit(kvm_cpuidle_exit); > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); > + > Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 > @@ -0,0 +1,39 @@ > +KVM guest halt polling > +====================== > + > +The cpuidle_kvm driver allows the guest vcpus to poll for a specified > +amount of time before halting. This provides the following benefits > +to host side polling: > + > + 1) The POLL flag is set while polling is performed, which allows > + a remote vCPU to avoid sending an IPI (and the associated > + cost of handling the IPI) when performing a wakeup. > + > + 2) The HLT VM-exit cost can be avoided. > + > +The downside of guest side polling is that polling is performed > +even with other runnable tasks in the host. > + > +Module Parameters > +================= > + > +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, > +the amount of time, in nanoseconds, that polling is performed before > +halting. > + > +This module parameter can be set from the debugfs files in: > + > + /sys/module/cpuidle_kvm/parameters/ > + > +Further Notes > +============= > + > +- Care should be taken when setting the guest_halt_poll_ns parameter as a > +large value has the potential to drive the cpu usage to 100% on a machine which > +would be almost entirely idle otherwise. > + > +- The effective amount of time that polling is performed is the host poll > +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests > +on a host system support and have properly configured guest_halt_poll_ns, > +then setting halt_poll_ns to 0 in the host is probably the best choice. > + >
On 20.05.19 13:51, Paolo Bonzini wrote: > On 17/05/19 19:48, Marcelo Tosatti wrote: >> >> The cpuidle_kvm driver allows the guest vcpus to poll for a specified >> amount of time before halting. This provides the following benefits >> to host side polling: >> >> 1) The POLL flag is set while polling is performed, which allows >> a remote vCPU to avoid sending an IPI (and the associated >> cost of handling the IPI) when performing a wakeup. >> >> 2) The HLT VM-exit cost can be avoided. >> >> The downside of guest side polling is that polling is performed >> even with other runnable tasks in the host. >> >> Results comparing halt_poll_ns and server/client application >> where a small packet is ping-ponged: >> >> host --> 31.33 >> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) >> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) >> >> For the SAP HANA benchmarks (where idle_spin is a parameter >> of the previous version of the patch, results should be the >> same): >> >> hpns == halt_poll_ns >> >> idle_spin=0/ idle_spin=800/ idle_spin=0/ >> hpns=200000 hpns=0 hpns=800000 >> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) >> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) >> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) >> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > Hi Marcelo, > > some quick observations: > > 1) This is actually not KVM-specific, so the name and placement of the > docs should be adjusted. > > 2) Regarding KVM-specific code, however, we could add an MSR so that KVM > disables halt_poll_ns for this VM when this is active in the guest? The whole code looks pretty much architecture independent. I have also seen cases on s390 where this kind of code would make sense. Can we try to make this usable for other archs as well? > > 3) The spin time could use the same adaptive algorithm that KVM uses in > the host. > > Thanks, > > Paolo > > >> --- >> Documentation/virtual/kvm/guest-halt-polling.txt | 39 ++++++++ >> arch/x86/Kconfig | 9 + >> arch/x86/kernel/Makefile | 1 >> arch/x86/kernel/cpuidle_kvm.c | 105 +++++++++++++++++++++++ >> arch/x86/kernel/process.c | 2 >> 5 files changed, 155 insertions(+), 1 deletion(-) >> >> Index: linux-2.6.git/arch/x86/Kconfig >> =================================================================== >> --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 >> +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 >> @@ -805,6 +805,15 @@ >> underlying device model, the host provides the guest with >> timing infrastructure such as time of day, and system time >> >> +config KVM_CPUIDLE >> + tristate "KVM cpuidle driver" >> + depends on KVM_GUEST >> + default y >> + help >> + This option enables KVM cpuidle driver, which allows to poll >> + before halting in the guest (more efficient than polling in the >> + host via halt_poll_ns for some scenarios). >> + >> config PVH >> bool "Support for running PVH guests" >> ---help--- >> Index: linux-2.6.git/arch/x86/kernel/Makefile >> =================================================================== >> --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 >> +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 >> @@ -112,6 +112,7 @@ >> obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o >> >> obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o >> +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o >> obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o >> obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o >> obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o >> Index: linux-2.6.git/arch/x86/kernel/process.c >> =================================================================== >> --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 >> +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 >> @@ -580,7 +580,7 @@ >> safe_halt(); >> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); >> } >> -#ifdef CONFIG_APM_MODULE >> +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) >> EXPORT_SYMBOL(default_idle); >> #endif >> >> Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c >> =================================================================== >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >> +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 >> @@ -0,0 +1,105 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * cpuidle driver for KVM guests. >> + * >> + * Copyright 2019 Red Hat, Inc. and/or its affiliates. >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2. See >> + * the COPYING file in the top-level directory. >> + * >> + * Authors: Marcelo Tosatti <mtosatti@redhat.com> >> + */ >> + >> +#include <linux/init.h> >> +#include <linux/cpuidle.h> >> +#include <linux/module.h> >> +#include <linux/timekeeping.h> >> +#include <linux/sched/idle.h> >> + >> +unsigned int guest_halt_poll_ns; >> +module_param(guest_halt_poll_ns, uint, 0644); >> + >> +static int kvm_enter_idle(struct cpuidle_device *dev, >> + struct cpuidle_driver *drv, int index) >> +{ >> + int do_halt = 0; >> + >> + /* No polling */ >> + if (guest_halt_poll_ns == 0) { >> + if (current_clr_polling_and_test()) { >> + local_irq_enable(); >> + return index; >> + } >> + default_idle(); >> + return index; >> + } >> + >> + local_irq_enable(); >> + if (!current_set_polling_and_test()) { >> + ktime_t now, end_spin; >> + >> + now = ktime_get(); >> + end_spin = ktime_add_ns(now, guest_halt_poll_ns); >> + >> + while (!need_resched()) { >> + cpu_relax(); >> + now = ktime_get(); >> + >> + if (!ktime_before(now, end_spin)) { >> + do_halt = 1; >> + break; >> + } >> + } >> + } >> + >> + if (do_halt) { >> + /* >> + * No events while busy spin window passed, >> + * halt. >> + */ >> + local_irq_disable(); >> + if (current_clr_polling_and_test()) { >> + local_irq_enable(); >> + return index; >> + } >> + default_idle(); >> + } else { >> + current_clr_polling(); >> + } >> + >> + return index; >> +} >> + >> +static struct cpuidle_driver kvm_idle_driver = { >> + .name = "kvm_idle", >> + .owner = THIS_MODULE, >> + .states = { >> + { /* entry 0 is for polling */ }, >> + { >> + .enter = kvm_enter_idle, >> + .exit_latency = 0, >> + .target_residency = 0, >> + .power_usage = -1, >> + .name = "KVM", >> + .desc = "KVM idle", >> + }, >> + }, >> + .safe_state_index = 0, >> + .state_count = 2, >> +}; >> + >> +static int __init kvm_cpuidle_init(void) >> +{ >> + return cpuidle_register(&kvm_idle_driver, NULL); >> +} >> + >> +static void __exit kvm_cpuidle_exit(void) >> +{ >> + cpuidle_unregister(&kvm_idle_driver); >> +} >> + >> +module_init(kvm_cpuidle_init); >> +module_exit(kvm_cpuidle_exit); >> +MODULE_LICENSE("GPL"); >> +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); >> + >> Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt >> =================================================================== >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >> +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 >> @@ -0,0 +1,39 @@ >> +KVM guest halt polling >> +====================== >> + >> +The cpuidle_kvm driver allows the guest vcpus to poll for a specified >> +amount of time before halting. This provides the following benefits >> +to host side polling: >> + >> + 1) The POLL flag is set while polling is performed, which allows >> + a remote vCPU to avoid sending an IPI (and the associated >> + cost of handling the IPI) when performing a wakeup. >> + >> + 2) The HLT VM-exit cost can be avoided. >> + >> +The downside of guest side polling is that polling is performed >> +even with other runnable tasks in the host. >> + >> +Module Parameters >> +================= >> + >> +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, >> +the amount of time, in nanoseconds, that polling is performed before >> +halting. >> + >> +This module parameter can be set from the debugfs files in: >> + >> + /sys/module/cpuidle_kvm/parameters/ >> + >> +Further Notes >> +============= >> + >> +- Care should be taken when setting the guest_halt_poll_ns parameter as a >> +large value has the potential to drive the cpu usage to 100% on a machine which >> +would be almost entirely idle otherwise. >> + >> +- The effective amount of time that polling is performed is the host poll >> +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests >> +on a host system support and have properly configured guest_halt_poll_ns, >> +then setting halt_poll_ns to 0 in the host is probably the best choice. >> + >> >
On Mon, May 20, 2019 at 01:51:57PM +0200, Paolo Bonzini wrote: > On 17/05/19 19:48, Marcelo Tosatti wrote: > > > > The cpuidle_kvm driver allows the guest vcpus to poll for a specified > > amount of time before halting. This provides the following benefits > > to host side polling: > > > > 1) The POLL flag is set while polling is performed, which allows > > a remote vCPU to avoid sending an IPI (and the associated > > cost of handling the IPI) when performing a wakeup. > > > > 2) The HLT VM-exit cost can be avoided. > > > > The downside of guest side polling is that polling is performed > > even with other runnable tasks in the host. > > > > Results comparing halt_poll_ns and server/client application > > where a small packet is ping-ponged: > > > > host --> 31.33 > > halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > > halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > > > For the SAP HANA benchmarks (where idle_spin is a parameter > > of the previous version of the patch, results should be the > > same): > > > > hpns == halt_poll_ns > > > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > > hpns=200000 hpns=0 hpns=800000 > > DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > > InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > > DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > > UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > Hi Marcelo, > > some quick observations: > > 1) This is actually not KVM-specific, so the name and placement of the > docs should be adjusted. Agreed. Will call it: cpuidle_halt_poll, move it to drivers/cpuidle/ > 2) Regarding KVM-specific code, however, we could add an MSR so that KVM > disables halt_poll_ns for this VM when this is active in the guest? Sure. > 3) The spin time could use the same adaptive algorithm that KVM uses in > the host. Agreed... This can be done later, i suppose (the current fixed setting works sufficiently well for our needs). > Thanks, > > Paolo > > > > --- > > Documentation/virtual/kvm/guest-halt-polling.txt | 39 ++++++++ > > arch/x86/Kconfig | 9 + > > arch/x86/kernel/Makefile | 1 > > arch/x86/kernel/cpuidle_kvm.c | 105 +++++++++++++++++++++++ > > arch/x86/kernel/process.c | 2 > > 5 files changed, 155 insertions(+), 1 deletion(-) > > > > Index: linux-2.6.git/arch/x86/Kconfig > > =================================================================== > > --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 > > +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 > > @@ -805,6 +805,15 @@ > > underlying device model, the host provides the guest with > > timing infrastructure such as time of day, and system time > > > > +config KVM_CPUIDLE > > + tristate "KVM cpuidle driver" > > + depends on KVM_GUEST > > + default y > > + help > > + This option enables KVM cpuidle driver, which allows to poll > > + before halting in the guest (more efficient than polling in the > > + host via halt_poll_ns for some scenarios). > > + > > config PVH > > bool "Support for running PVH guests" > > ---help--- > > Index: linux-2.6.git/arch/x86/kernel/Makefile > > =================================================================== > > --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 > > +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 > > @@ -112,6 +112,7 @@ > > obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o > > > > obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o > > +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o > > obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o > > obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o > > obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o > > Index: linux-2.6.git/arch/x86/kernel/process.c > > =================================================================== > > --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 > > +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 > > @@ -580,7 +580,7 @@ > > safe_halt(); > > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > > } > > -#ifdef CONFIG_APM_MODULE > > +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) > > EXPORT_SYMBOL(default_idle); > > #endif > > > > Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c > > =================================================================== > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 > > @@ -0,0 +1,105 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * cpuidle driver for KVM guests. > > + * > > + * Copyright 2019 Red Hat, Inc. and/or its affiliates. > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2. See > > + * the COPYING file in the top-level directory. > > + * > > + * Authors: Marcelo Tosatti <mtosatti@redhat.com> > > + */ > > + > > +#include <linux/init.h> > > +#include <linux/cpuidle.h> > > +#include <linux/module.h> > > +#include <linux/timekeeping.h> > > +#include <linux/sched/idle.h> > > + > > +unsigned int guest_halt_poll_ns; > > +module_param(guest_halt_poll_ns, uint, 0644); > > + > > +static int kvm_enter_idle(struct cpuidle_device *dev, > > + struct cpuidle_driver *drv, int index) > > +{ > > + int do_halt = 0; > > + > > + /* No polling */ > > + if (guest_halt_poll_ns == 0) { > > + if (current_clr_polling_and_test()) { > > + local_irq_enable(); > > + return index; > > + } > > + default_idle(); > > + return index; > > + } > > + > > + local_irq_enable(); > > + if (!current_set_polling_and_test()) { > > + ktime_t now, end_spin; > > + > > + now = ktime_get(); > > + end_spin = ktime_add_ns(now, guest_halt_poll_ns); > > + > > + while (!need_resched()) { > > + cpu_relax(); > > + now = ktime_get(); > > + > > + if (!ktime_before(now, end_spin)) { > > + do_halt = 1; > > + break; > > + } > > + } > > + } > > + > > + if (do_halt) { > > + /* > > + * No events while busy spin window passed, > > + * halt. > > + */ > > + local_irq_disable(); > > + if (current_clr_polling_and_test()) { > > + local_irq_enable(); > > + return index; > > + } > > + default_idle(); > > + } else { > > + current_clr_polling(); > > + } > > + > > + return index; > > +} > > + > > +static struct cpuidle_driver kvm_idle_driver = { > > + .name = "kvm_idle", > > + .owner = THIS_MODULE, > > + .states = { > > + { /* entry 0 is for polling */ }, > > + { > > + .enter = kvm_enter_idle, > > + .exit_latency = 0, > > + .target_residency = 0, > > + .power_usage = -1, > > + .name = "KVM", > > + .desc = "KVM idle", > > + }, > > + }, > > + .safe_state_index = 0, > > + .state_count = 2, > > +}; > > + > > +static int __init kvm_cpuidle_init(void) > > +{ > > + return cpuidle_register(&kvm_idle_driver, NULL); > > +} > > + > > +static void __exit kvm_cpuidle_exit(void) > > +{ > > + cpuidle_unregister(&kvm_idle_driver); > > +} > > + > > +module_init(kvm_cpuidle_init); > > +module_exit(kvm_cpuidle_exit); > > +MODULE_LICENSE("GPL"); > > +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); > > + > > Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt > > =================================================================== > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 > > @@ -0,0 +1,39 @@ > > +KVM guest halt polling > > +====================== > > + > > +The cpuidle_kvm driver allows the guest vcpus to poll for a specified > > +amount of time before halting. This provides the following benefits > > +to host side polling: > > + > > + 1) The POLL flag is set while polling is performed, which allows > > + a remote vCPU to avoid sending an IPI (and the associated > > + cost of handling the IPI) when performing a wakeup. > > + > > + 2) The HLT VM-exit cost can be avoided. > > + > > +The downside of guest side polling is that polling is performed > > +even with other runnable tasks in the host. > > + > > +Module Parameters > > +================= > > + > > +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, > > +the amount of time, in nanoseconds, that polling is performed before > > +halting. > > + > > +This module parameter can be set from the debugfs files in: > > + > > + /sys/module/cpuidle_kvm/parameters/ > > + > > +Further Notes > > +============= > > + > > +- Care should be taken when setting the guest_halt_poll_ns parameter as a > > +large value has the potential to drive the cpu usage to 100% on a machine which > > +would be almost entirely idle otherwise. > > + > > +- The effective amount of time that polling is performed is the host poll > > +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests > > +on a host system support and have properly configured guest_halt_poll_ns, > > +then setting halt_poll_ns to 0 in the host is probably the best choice. > > + > >
On Mon, May 20, 2019 at 02:07:09PM +0200, Christian Borntraeger wrote: > > > On 20.05.19 13:51, Paolo Bonzini wrote: > > On 17/05/19 19:48, Marcelo Tosatti wrote: > >> > >> The cpuidle_kvm driver allows the guest vcpus to poll for a specified > >> amount of time before halting. This provides the following benefits > >> to host side polling: > >> > >> 1) The POLL flag is set while polling is performed, which allows > >> a remote vCPU to avoid sending an IPI (and the associated > >> cost of handling the IPI) when performing a wakeup. > >> > >> 2) The HLT VM-exit cost can be avoided. > >> > >> The downside of guest side polling is that polling is performed > >> even with other runnable tasks in the host. > >> > >> Results comparing halt_poll_ns and server/client application > >> where a small packet is ping-ponged: > >> > >> host --> 31.33 > >> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > >> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > >> > >> For the SAP HANA benchmarks (where idle_spin is a parameter > >> of the previous version of the patch, results should be the > >> same): > >> > >> hpns == halt_poll_ns > >> > >> idle_spin=0/ idle_spin=800/ idle_spin=0/ > >> hpns=200000 hpns=0 hpns=800000 > >> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > >> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > >> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > >> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > > > Hi Marcelo, > > > > some quick observations: > > > > 1) This is actually not KVM-specific, so the name and placement of the > > docs should be adjusted. > > > > 2) Regarding KVM-specific code, however, we could add an MSR so that KVM > > disables halt_poll_ns for this VM when this is active in the guest? > > The whole code looks pretty much architecture independent. I have also seen cases > on s390 where this kind of code would make sense. Can we try to make this > usable for other archs as well? Will move to drivers/cpuidle/
On 20.05.19 14:07, Christian Borntraeger wrote: > > > On 20.05.19 13:51, Paolo Bonzini wrote: >> On 17/05/19 19:48, Marcelo Tosatti wrote: >>> >>> The cpuidle_kvm driver allows the guest vcpus to poll for a specified >>> amount of time before halting. This provides the following benefits >>> to host side polling: >>> >>> 1) The POLL flag is set while polling is performed, which allows >>> a remote vCPU to avoid sending an IPI (and the associated >>> cost of handling the IPI) when performing a wakeup. >>> >>> 2) The HLT VM-exit cost can be avoided. >>> >>> The downside of guest side polling is that polling is performed >>> even with other runnable tasks in the host. >>> >>> Results comparing halt_poll_ns and server/client application >>> where a small packet is ping-ponged: >>> >>> host --> 31.33 >>> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) >>> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) >>> >>> For the SAP HANA benchmarks (where idle_spin is a parameter >>> of the previous version of the patch, results should be the >>> same): >>> >>> hpns == halt_poll_ns >>> >>> idle_spin=0/ idle_spin=800/ idle_spin=0/ >>> hpns=200000 hpns=0 hpns=800000 >>> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) >>> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) >>> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) >>> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) >> >> Hi Marcelo, >> >> some quick observations: >> >> 1) This is actually not KVM-specific, so the name and placement of the >> docs should be adjusted. >> >> 2) Regarding KVM-specific code, however, we could add an MSR so that KVM >> disables halt_poll_ns for this VM when this is active in the guest? > > The whole code looks pretty much architecture independent. I have also seen cases > on s390 where this kind of code would make sense. Can we try to make this > usable for other archs as well? I did a quick hack (not yet for the list as it contains some uglyness). and the code seems to run ok on s390. So any chance to move this into drivers/cpuidle/ so that !x86 can also enable that when appropriate? I actually agree with Paolo that we should disable host halt polling as soon as the guest does it. Maybe we should have some arch specific callback (that can be an MSR). > > >> >> 3) The spin time could use the same adaptive algorithm that KVM uses in >> the host. >> >> Thanks, >> >> Paolo >> >> >>> --- >>> Documentation/virtual/kvm/guest-halt-polling.txt | 39 ++++++++ >>> arch/x86/Kconfig | 9 + >>> arch/x86/kernel/Makefile | 1 >>> arch/x86/kernel/cpuidle_kvm.c | 105 +++++++++++++++++++++++ >>> arch/x86/kernel/process.c | 2 >>> 5 files changed, 155 insertions(+), 1 deletion(-) >>> >>> Index: linux-2.6.git/arch/x86/Kconfig >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 >>> +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 >>> @@ -805,6 +805,15 @@ >>> underlying device model, the host provides the guest with >>> timing infrastructure such as time of day, and system time >>> >>> +config KVM_CPUIDLE >>> + tristate "KVM cpuidle driver" >>> + depends on KVM_GUEST >>> + default y >>> + help >>> + This option enables KVM cpuidle driver, which allows to poll >>> + before halting in the guest (more efficient than polling in the >>> + host via halt_poll_ns for some scenarios). >>> + >>> config PVH >>> bool "Support for running PVH guests" >>> ---help--- >>> Index: linux-2.6.git/arch/x86/kernel/Makefile >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 >>> +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 >>> @@ -112,6 +112,7 @@ >>> obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o >>> >>> obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o >>> +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o >>> obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o >>> obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o >>> obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o >>> Index: linux-2.6.git/arch/x86/kernel/process.c >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 >>> +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 >>> @@ -580,7 +580,7 @@ >>> safe_halt(); >>> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); >>> } >>> -#ifdef CONFIG_APM_MODULE >>> +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) >>> EXPORT_SYMBOL(default_idle); >>> #endif >>> >>> Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c >>> =================================================================== >>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >>> +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 >>> @@ -0,0 +1,105 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* >>> + * cpuidle driver for KVM guests. >>> + * >>> + * Copyright 2019 Red Hat, Inc. and/or its affiliates. >>> + * >>> + * This work is licensed under the terms of the GNU GPL, version 2. See >>> + * the COPYING file in the top-level directory. >>> + * >>> + * Authors: Marcelo Tosatti <mtosatti@redhat.com> >>> + */ >>> + >>> +#include <linux/init.h> >>> +#include <linux/cpuidle.h> >>> +#include <linux/module.h> >>> +#include <linux/timekeeping.h> >>> +#include <linux/sched/idle.h> >>> + >>> +unsigned int guest_halt_poll_ns; >>> +module_param(guest_halt_poll_ns, uint, 0644); >>> + >>> +static int kvm_enter_idle(struct cpuidle_device *dev, >>> + struct cpuidle_driver *drv, int index) >>> +{ >>> + int do_halt = 0; >>> + >>> + /* No polling */ >>> + if (guest_halt_poll_ns == 0) { >>> + if (current_clr_polling_and_test()) { >>> + local_irq_enable(); >>> + return index; >>> + } >>> + default_idle(); >>> + return index; >>> + } >>> + >>> + local_irq_enable(); >>> + if (!current_set_polling_and_test()) { >>> + ktime_t now, end_spin; >>> + >>> + now = ktime_get(); >>> + end_spin = ktime_add_ns(now, guest_halt_poll_ns); >>> + >>> + while (!need_resched()) { >>> + cpu_relax(); >>> + now = ktime_get(); >>> + >>> + if (!ktime_before(now, end_spin)) { >>> + do_halt = 1; >>> + break; >>> + } >>> + } >>> + } >>> + >>> + if (do_halt) { >>> + /* >>> + * No events while busy spin window passed, >>> + * halt. >>> + */ >>> + local_irq_disable(); >>> + if (current_clr_polling_and_test()) { >>> + local_irq_enable(); >>> + return index; >>> + } >>> + default_idle(); >>> + } else { >>> + current_clr_polling(); >>> + } >>> + >>> + return index; >>> +} >>> + >>> +static struct cpuidle_driver kvm_idle_driver = { >>> + .name = "kvm_idle", >>> + .owner = THIS_MODULE, >>> + .states = { >>> + { /* entry 0 is for polling */ }, >>> + { >>> + .enter = kvm_enter_idle, >>> + .exit_latency = 0, >>> + .target_residency = 0, >>> + .power_usage = -1, >>> + .name = "KVM", >>> + .desc = "KVM idle", >>> + }, >>> + }, >>> + .safe_state_index = 0, >>> + .state_count = 2, >>> +}; >>> + >>> +static int __init kvm_cpuidle_init(void) >>> +{ >>> + return cpuidle_register(&kvm_idle_driver, NULL); >>> +} >>> + >>> +static void __exit kvm_cpuidle_exit(void) >>> +{ >>> + cpuidle_unregister(&kvm_idle_driver); >>> +} >>> + >>> +module_init(kvm_cpuidle_init); >>> +module_exit(kvm_cpuidle_exit); >>> +MODULE_LICENSE("GPL"); >>> +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); >>> + >>> Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt >>> =================================================================== >>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >>> +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 >>> @@ -0,0 +1,39 @@ >>> +KVM guest halt polling >>> +====================== >>> + >>> +The cpuidle_kvm driver allows the guest vcpus to poll for a specified >>> +amount of time before halting. This provides the following benefits >>> +to host side polling: >>> + >>> + 1) The POLL flag is set while polling is performed, which allows >>> + a remote vCPU to avoid sending an IPI (and the associated >>> + cost of handling the IPI) when performing a wakeup. >>> + >>> + 2) The HLT VM-exit cost can be avoided. >>> + >>> +The downside of guest side polling is that polling is performed >>> +even with other runnable tasks in the host. >>> + >>> +Module Parameters >>> +================= >>> + >>> +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, >>> +the amount of time, in nanoseconds, that polling is performed before >>> +halting. >>> + >>> +This module parameter can be set from the debugfs files in: >>> + >>> + /sys/module/cpuidle_kvm/parameters/ >>> + >>> +Further Notes >>> +============= >>> + >>> +- Care should be taken when setting the guest_halt_poll_ns parameter as a >>> +large value has the potential to drive the cpu usage to 100% on a machine which >>> +would be almost entirely idle otherwise. >>> + >>> +- The effective amount of time that polling is performed is the host poll >>> +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests >>> +on a host system support and have properly configured guest_halt_poll_ns, >>> +then setting halt_poll_ns to 0 in the host is probably the best choice. >>> + >>> >>
On Mon, May 20, 2019 at 01:51:57PM +0200, Paolo Bonzini wrote: > On 17/05/19 19:48, Marcelo Tosatti wrote: > > > > The cpuidle_kvm driver allows the guest vcpus to poll for a specified > > amount of time before halting. This provides the following benefits > > to host side polling: > > > > 1) The POLL flag is set while polling is performed, which allows > > a remote vCPU to avoid sending an IPI (and the associated > > cost of handling the IPI) when performing a wakeup. > > > > 2) The HLT VM-exit cost can be avoided. > > > > The downside of guest side polling is that polling is performed > > even with other runnable tasks in the host. > > > > Results comparing halt_poll_ns and server/client application > > where a small packet is ping-ponged: > > > > host --> 31.33 > > halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > > halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > > > > For the SAP HANA benchmarks (where idle_spin is a parameter > > of the previous version of the patch, results should be the > > same): > > > > hpns == halt_poll_ns > > > > idle_spin=0/ idle_spin=800/ idle_spin=0/ > > hpns=200000 hpns=0 hpns=800000 > > DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > > InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > > DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > > UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > > Hi Marcelo, > > some quick observations: > > 1) This is actually not KVM-specific, so the name and placement of the > docs should be adjusted. > > 2) Regarding KVM-specific code, however, we could add an MSR so that KVM > disables halt_poll_ns for this VM when this is active in the guest? > > 3) The spin time could use the same adaptive algorithm that KVM uses in > the host. Hi Paolo, Consider sequence of wakeup events as follows: 20us, 200us, 20us, 200us... 1) halt_poll_ns=250us v->halt_poll_ns=0us wakeup=20us grow sets v->halt_poll_ns = 20us 2) halt_poll_ns=250us v->halt_poll_ns=20us wakeup=200us grow sets v->halt_poll_ns = 40us 3) halt_poll_ns=250us v->halt_poll_ns=40us wakeup=20us v->halt_poll_ns untouched Doubling repeats until v->halt_poll_ns=80, 160, 250us. N) halt_poll_ns=250us v->halt_poll_ns=250us wakeup=20us If in the middle of the 20us,200us,20us... sequence you block for a value larger than halt_poll_ns (250 in this case), the logic today will either: 1) set v->halt_poll_ns to zero. 2) set halt_poll_ns to 125us (if you set shrink to 2). In either case, you lose (one missed event any time block_time > halt_poll_ns). If one enables guest halt polling in the first place, then the energy/performance tradeoff is bend towards performance, and such misses are harmful. So going to add something along the lines of: "If, after 50 consecutive times, block_time is much larger than halt_poll_ns, then set cpu->halt_poll_ns to zero". Restore user halt_poll_ns value once one smaller block_time is observed. This should cover the full idle case, and cause minimal harm to performance. Is that OK or is there any other characteristic of adaptive halt poll you are looking for?
On Mon, May 20, 2019 at 03:46:50PM +0200, Christian Borntraeger wrote: > > > On 20.05.19 14:07, Christian Borntraeger wrote: > > > > > > On 20.05.19 13:51, Paolo Bonzini wrote: > >> On 17/05/19 19:48, Marcelo Tosatti wrote: > >>> > >>> The cpuidle_kvm driver allows the guest vcpus to poll for a specified > >>> amount of time before halting. This provides the following benefits > >>> to host side polling: > >>> > >>> 1) The POLL flag is set while polling is performed, which allows > >>> a remote vCPU to avoid sending an IPI (and the associated > >>> cost of handling the IPI) when performing a wakeup. > >>> > >>> 2) The HLT VM-exit cost can be avoided. > >>> > >>> The downside of guest side polling is that polling is performed > >>> even with other runnable tasks in the host. > >>> > >>> Results comparing halt_poll_ns and server/client application > >>> where a small packet is ping-ponged: > >>> > >>> host --> 31.33 > >>> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) > >>> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) > >>> > >>> For the SAP HANA benchmarks (where idle_spin is a parameter > >>> of the previous version of the patch, results should be the > >>> same): > >>> > >>> hpns == halt_poll_ns > >>> > >>> idle_spin=0/ idle_spin=800/ idle_spin=0/ > >>> hpns=200000 hpns=0 hpns=800000 > >>> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) > >>> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) > >>> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) > >>> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) > >> > >> Hi Marcelo, > >> > >> some quick observations: > >> > >> 1) This is actually not KVM-specific, so the name and placement of the > >> docs should be adjusted. > >> > >> 2) Regarding KVM-specific code, however, we could add an MSR so that KVM > >> disables halt_poll_ns for this VM when this is active in the guest? > > > > The whole code looks pretty much architecture independent. I have also seen cases > > on s390 where this kind of code would make sense. Can we try to make this > > usable for other archs as well? > > I did a quick hack (not yet for the list as it contains some uglyness). > and the code seems to run ok on s390. > So any chance to move this into drivers/cpuidle/ so that !x86 can also enable that > when appropriate? Done that, but you have to provide default_idle function and then later change: +config HALTPOLL_CPUIDLE + tristate "Halt poll cpuidle driver" + depends on X86 + default y + help > I actually agree with Paolo that we should disable host halt polling as soon as > the guest does it. Maybe we should have some arch specific callback (that can be > an MSR). Yep.
On 22/05/19 17:04, Marcelo Tosatti wrote: > Consider sequence of wakeup events as follows: > 20us, 200us, 20us, 200us... I agree it can happen, which is why the grow/shrink behavior can be disabled for halt_poll_ns. Is there a real-world usecase with a sequence like this? The main qualm I have with guest-side polling is that it encourages the guest admin to be "impolite". But I guess it was possible even now to boot guests with idle=poll, which would be way more impolite... Paolo > If one enables guest halt polling in the first place, > then the energy/performance tradeoff is bend towards > performance, and such misses are harmful. > > So going to add something along the lines of: > > "If, after 50 consecutive times, block_time is much larger than > halt_poll_ns, then set cpu->halt_poll_ns to zero". > > Restore user halt_poll_ns value once one smaller block_time > is observed. > > This should cover the full idle case, and cause minimal > harm to performance. > > Is that OK or is there any other characteristic of > adaptive halt poll you are looking for?
On Wed, May 22, 2019 at 05:44:34PM +0200, Paolo Bonzini wrote: > On 22/05/19 17:04, Marcelo Tosatti wrote: > > Consider sequence of wakeup events as follows: > > 20us, 200us, 20us, 200us... > > I agree it can happen, which is why the grow/shrink behavior can be > disabled for halt_poll_ns. Is there a real-world usecase with a > sequence like this? If you have a database with variable response times in the 20,200,20,200... range, then yes. Its not a bizzarre/unlikely sequence. You didnt answer my question at the end of the email. > The main qualm I have with guest-side polling is that it encourages the > guest admin to be "impolite". But I guess it was possible even now to > boot guests with idle=poll, which would be way more impolite... Yep. Thanks. > Paolo > > > If one enables guest halt polling in the first place, > > then the energy/performance tradeoff is bend towards > > performance, and such misses are harmful. > > > > So going to add something along the lines of: > > > > "If, after 50 consecutive times, block_time is much larger than > > halt_poll_ns, then set cpu->halt_poll_ns to zero". > > > > Restore user halt_poll_ns value once one smaller block_time > > is observed. > > > > This should cover the full idle case, and cause minimal > > harm to performance. > > > > Is that OK or is there any other characteristic of > > adaptive halt poll you are looking for?
Index: linux-2.6.git/arch/x86/Kconfig =================================================================== --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 @@ -805,6 +805,15 @@ underlying device model, the host provides the guest with timing infrastructure such as time of day, and system time +config KVM_CPUIDLE + tristate "KVM cpuidle driver" + depends on KVM_GUEST + default y + help + This option enables KVM cpuidle driver, which allows to poll + before halting in the guest (more efficient than polling in the + host via halt_poll_ns for some scenarios). + config PVH bool "Support for running PVH guests" ---help--- Index: linux-2.6.git/arch/x86/kernel/Makefile =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 @@ -112,6 +112,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o Index: linux-2.6.git/arch/x86/kernel/process.c =================================================================== --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 @@ -580,7 +580,7 @@ safe_halt(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); } -#ifdef CONFIG_APM_MODULE +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) EXPORT_SYMBOL(default_idle); #endif Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 @@ -0,0 +1,105 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * cpuidle driver for KVM guests. + * + * Copyright 2019 Red Hat, Inc. and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Authors: Marcelo Tosatti <mtosatti@redhat.com> + */ + +#include <linux/init.h> +#include <linux/cpuidle.h> +#include <linux/module.h> +#include <linux/timekeeping.h> +#include <linux/sched/idle.h> + +unsigned int guest_halt_poll_ns; +module_param(guest_halt_poll_ns, uint, 0644); + +static int kvm_enter_idle(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + int do_halt = 0; + + /* No polling */ + if (guest_halt_poll_ns == 0) { + if (current_clr_polling_and_test()) { + local_irq_enable(); + return index; + } + default_idle(); + return index; + } + + local_irq_enable(); + if (!current_set_polling_and_test()) { + ktime_t now, end_spin; + + now = ktime_get(); + end_spin = ktime_add_ns(now, guest_halt_poll_ns); + + while (!need_resched()) { + cpu_relax(); + now = ktime_get(); + + if (!ktime_before(now, end_spin)) { + do_halt = 1; + break; + } + } + } + + if (do_halt) { + /* + * No events while busy spin window passed, + * halt. + */ + local_irq_disable(); + if (current_clr_polling_and_test()) { + local_irq_enable(); + return index; + } + default_idle(); + } else { + current_clr_polling(); + } + + return index; +} + +static struct cpuidle_driver kvm_idle_driver = { + .name = "kvm_idle", + .owner = THIS_MODULE, + .states = { + { /* entry 0 is for polling */ }, + { + .enter = kvm_enter_idle, + .exit_latency = 0, + .target_residency = 0, + .power_usage = -1, + .name = "KVM", + .desc = "KVM idle", + }, + }, + .safe_state_index = 0, + .state_count = 2, +}; + +static int __init kvm_cpuidle_init(void) +{ + return cpuidle_register(&kvm_idle_driver, NULL); +} + +static void __exit kvm_cpuidle_exit(void) +{ + cpuidle_unregister(&kvm_idle_driver); +} + +module_init(kvm_cpuidle_init); +module_exit(kvm_cpuidle_exit); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@redhat.com>"); + Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 @@ -0,0 +1,39 @@ +KVM guest halt polling +====================== + +The cpuidle_kvm driver allows the guest vcpus to poll for a specified +amount of time before halting. This provides the following benefits +to host side polling: + + 1) The POLL flag is set while polling is performed, which allows + a remote vCPU to avoid sending an IPI (and the associated + cost of handling the IPI) when performing a wakeup. + + 2) The HLT VM-exit cost can be avoided. + +The downside of guest side polling is that polling is performed +even with other runnable tasks in the host. + +Module Parameters +================= + +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, +the amount of time, in nanoseconds, that polling is performed before +halting. + +This module parameter can be set from the debugfs files in: + + /sys/module/cpuidle_kvm/parameters/ + +Further Notes +============= + +- Care should be taken when setting the guest_halt_poll_ns parameter as a +large value has the potential to drive the cpu usage to 100% on a machine which +would be almost entirely idle otherwise. + +- The effective amount of time that polling is performed is the host poll +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests +on a host system support and have properly configured guest_halt_poll_ns, +then setting halt_poll_ns to 0 in the host is probably the best choice. +