diff mbox series

KVM: x86: disable halt polling when powersave governor is used

Message ID 20220915073121.1038840-1-dapeng1.mi@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: disable halt polling when powersave governor is used | expand

Commit Message

Mi, Dapeng1 Sept. 15, 2022, 7:31 a.m. UTC
Halt polling is enabled by default even through the CPU frequency
governor is configured to powersave. Generally halt polling would
consume extra power and this's not identical with the intent of
powersave governor.

disabling halt polling in powersave governor can save the precious
power in power critical case.

FIO random read test on Alder Lake platform shows halt polling
occupies ~17% CPU utilization and consume 7% extra CPU power.
After disabling halt polling, CPU has more chance to enter deeper
C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%).

On Alder Lake platform, we don't find there are obvious performance
downgrade after disabling halt polling on FIO and Netperf cases.
Netperf UDP_RR case runs from two VMs locate on two different physical
machines.

FIO(MB/s)	Base	Disable-halt-polling	Delta%
Rand-read	432.6	436.3			0.8%

Netperf		Base	Disable-halt-polling	Delta%
UDP_RR          509.8	508.5			-0.3%

Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com>
---
 arch/x86/kvm/x86.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Sean Christopherson Oct. 7, 2022, 5:51 p.m. UTC | #1
On Thu, Sep 15, 2022, Dapeng Mi wrote:
> Halt polling is enabled by default even through the CPU frequency
> governor is configured to powersave. Generally halt polling would
> consume extra power and this's not identical with the intent of
> powersave governor.
> 
> disabling halt polling in powersave governor can save the precious
> power in power critical case.
> 
> FIO random read test on Alder Lake platform shows halt polling
> occupies ~17% CPU utilization and consume 7% extra CPU power.
> After disabling halt polling, CPU has more chance to enter deeper
> C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%).
> 
> On Alder Lake platform, we don't find there are obvious performance
> downgrade after disabling halt polling on FIO and Netperf cases.
> Netperf UDP_RR case runs from two VMs locate on two different physical
> machines.
> 
> FIO(MB/s)	Base	Disable-halt-polling	Delta%
> Rand-read	432.6	436.3			0.8%
> 
> Netperf		Base	Disable-halt-polling	Delta%
> UDP_RR          509.8	508.5			-0.3%
> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com>
> ---
>  arch/x86/kvm/x86.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7374d768296..c0eb6574cbbb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void)
>  
>  bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>  {
> -	return (vcpu->arch.msr_kvm_poll_control & 1) == 0;
> +	struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu);

Preemption is not disabled at this point, which means that using vcpu->cpu is
potentially unsafe.  Given that cpufreq is refcounting the returned object, I gotta
imaging get migrated to a different pCPU would be problematic.

> +	bool powersave = false;

I don't see anything in here that's x86 specific.  Unless I'm missing something,
this belongs in common KVM.

> +
> +	/*
> +	 * Halt polling could consume much CPU power, if CPU frequency
> +	 * governor is set to "powersave", disable halt polling.
> +	 */
> +	if (policy) {
> +		if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) ||
> +			(policy->governor &&

Indentation is messed up.

> +				!strncmp(policy->governor->name, "powersave",

KVM should not be comparing magic strings.  If the cpufreq subsystem can't get
policy->policy right, then that needs to be fixed.

> +					CPUFREQ_NAME_LEN)))
> +			powersave = true;
> +		cpufreq_cpu_put(policy);
> +	}
> +	return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave;

Doing all of the above work if polling is disabled is silly.

>  }
>  EXPORT_SYMBOL_GPL(kvm_arch_no_poll);

All in all, _if_ we want to do this automatically and not let userspace decide how
to manage powersave vs. halt-poll, I think this should be more like:

---
 virt/kvm/kvm_main.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e30f1b4ecfa5..01116859cb31 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -29,6 +29,7 @@
 #include <linux/file.h>
 #include <linux/syscore_ops.h>
 #include <linux/cpu.h>
+#include <linux/cpufreq.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/stat.h>
@@ -3483,6 +3484,23 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
 	}
 }
 
+static bool kvm_cpufreq_no_halt_poll(struct kvm_vcpu *vcpu)
+{
+	struct cpufreq_policy *policy;
+	bool powersave = false;
+
+	preempt_disable();
+
+	policy = cpufreq_cpu_get(vcpu->cpu);
+	if (policy) {
+		powersave = (policy->policy == CPUFREQ_POLICY_POWERSAVE);
+		cpufreq_cpu_put(policy);
+	}
+
+	preempt_enable();
+	return powersave;
+}
+
 /*
  * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc...  If halt
  * polling is enabled, busy wait for a short time before blocking to avoid the
@@ -3491,7 +3509,8 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
  */
 void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
 {
-	bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
+	const bool halt_poll_allowed = !kvm_arch_no_poll(vcpu) &&
+				       !kvm_cpufreq_no_halt_poll(vcpu);
 	bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
 	ktime_t start, cur, poll_end;
 	bool waited = false;

base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
--
Mi, Dapeng1 Oct. 8, 2022, 9:40 a.m. UTC | #2
> From: Sean Christopherson <seanjc@google.com>
> Sent: Saturday, October 8, 2022 1:52 AM
> To: Mi, Dapeng1 <dapeng1.mi@intel.com>
> Cc: pbonzini@redhat.com; tglx@linutronix.de; mingo@redhat.com;
> dave.hansen@linux.intel.com; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; zhenyuw@linux.intel.com
> Subject: Re: [PATCH] KVM: x86: disable halt polling when powersave governor is
> used
> 
> On Thu, Sep 15, 2022, Dapeng Mi wrote:
> > Halt polling is enabled by default even through the CPU frequency
> > governor is configured to powersave. Generally halt polling would
> > consume extra power and this's not identical with the intent of
> > powersave governor.
> >
> > disabling halt polling in powersave governor can save the precious
> > power in power critical case.
> >
> > FIO random read test on Alder Lake platform shows halt polling
> > occupies ~17% CPU utilization and consume 7% extra CPU power.
> > After disabling halt polling, CPU has more chance to enter deeper
> > C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%).
> >
> > On Alder Lake platform, we don't find there are obvious performance
> > downgrade after disabling halt polling on FIO and Netperf cases.
> > Netperf UDP_RR case runs from two VMs locate on two different physical
> > machines.
> >
> > FIO(MB/s)	Base	Disable-halt-polling	Delta%
> > Rand-read	432.6	436.3			0.8%
> >
> > Netperf		Base	Disable-halt-polling	Delta%
> > UDP_RR          509.8	508.5			-0.3%
> >
> > Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com>
> > ---
> >  arch/x86/kvm/x86.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
> > d7374d768296..c0eb6574cbbb 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void)
> >
> >  bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)  {
> > -	return (vcpu->arch.msr_kvm_poll_control & 1) == 0;
> > +	struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu);
> 
> Preemption is not disabled at this point, which means that using vcpu->cpu is
> potentially unsafe.  Given that cpufreq is refcounting the returned object, I gotta
> imaging get migrated to a different pCPU would be problematic.

Thanks for pointing this. Per my learning, even vCPU migrates to a different pCPU in the
progress of getting the cpufreq policy, the only consequence is to get an outdated (maybe 
inaccurate) policy value. Halt polling mechanism still can get the updated and correct cpufreq
policy in next time. And even we disable preemption in the process of obtaining cpufreq policy,
the vCPU is still possible to be migrated a different pCPU after enabling preemption and before
calling the halt polling judging logic.

> 
> > +	bool powersave = false;
> 
> I don't see anything in here that's x86 specific.  Unless I'm missing something,
> this belongs in common KVM.
> 

Yes, this is generic. 

> > +
> > +	/*
> > +	 * Halt polling could consume much CPU power, if CPU frequency
> > +	 * governor is set to "powersave", disable halt polling.
> > +	 */
> > +	if (policy) {
> > +		if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) ||
> > +			(policy->governor &&
> 
> Indentation is messed up.

Sure. Would change.

> 
> > +				!strncmp(policy->governor->name,
> "powersave",
> 
> KVM should not be comparing magic strings.  If the cpufreq subsystem can't get
> policy->policy right, then that needs to be fixed.

Yeah, using magic strings looks a little bit strange, but this is what is cpufreq doing.
Currently cpufreq mechanism supports two kinds of drivers, one is the driver which has
the built-in governor, like intel_pstate driver. For this kind of driver, the cpufreq governor
is saved in the policy->policy field. The other is the traditional driver which is independent
with cpufreq governor and the cpufreq governor type is saved in the governor->name field.
For the second kind of cpufreq driver, the policy->policy field is meaningless and we have to
read the governor name. 

> 
> > +					CPUFREQ_NAME_LEN)))
> > +			powersave = true;
> > +		cpufreq_cpu_put(policy);
> > +	}
> > +	return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave;
> 
> Doing all of the above work if polling is disabled is silly.

Correct. Would change. 

> 
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
> 
> All in all, _if_ we want to do this automatically and not let userspace decide how
> to manage powersave vs. halt-poll, I think this should be more like:

Thanks for your sample. Would change in V2 patch. 

> 
> ---
>  virt/kvm/kvm_main.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index
> e30f1b4ecfa5..01116859cb31 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -29,6 +29,7 @@
>  #include <linux/file.h>
>  #include <linux/syscore_ops.h>
>  #include <linux/cpu.h>
> +#include <linux/cpufreq.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/stat.h>
> @@ -3483,6 +3484,23 @@ static inline void update_halt_poll_stats(struct
> kvm_vcpu *vcpu, ktime_t start,
>  	}
>  }
> 
> +static bool kvm_cpufreq_no_halt_poll(struct kvm_vcpu *vcpu) {
> +	struct cpufreq_policy *policy;
> +	bool powersave = false;
> +
> +	preempt_disable();
> +
> +	policy = cpufreq_cpu_get(vcpu->cpu);
> +	if (policy) {
> +		powersave = (policy->policy == CPUFREQ_POLICY_POWERSAVE);
> +		cpufreq_cpu_put(policy);
> +	}
> +
> +	preempt_enable();
> +	return powersave;
> +}
> +
>  /*
>   * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc...  If halt
>   * polling is enabled, busy wait for a short time before blocking to avoid the
> @@ -3491,7 +3509,8 @@ static inline void update_halt_poll_stats(struct
> kvm_vcpu *vcpu, ktime_t start,
>   */
>  void kvm_vcpu_halt(struct kvm_vcpu *vcpu)  {
> -	bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
> +	const bool halt_poll_allowed = !kvm_arch_no_poll(vcpu) &&
> +				       !kvm_cpufreq_no_halt_poll(vcpu);
>  	bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
>  	ktime_t start, cur, poll_end;
>  	bool waited = false;
> 
> base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
> --
Sean Christopherson Oct. 25, 2022, 5:06 p.m. UTC | #3
On Sat, Oct 08, 2022, Mi, Dapeng1 wrote:
> > > +				!strncmp(policy->governor->name,
> > "powersave",
> > 
> > KVM should not be comparing magic strings.  If the cpufreq subsystem can't get
> > policy->policy right, then that needs to be fixed.
> 
> Yeah, using magic strings looks a little bit strange, but this is what is
> cpufreq doing.  Currently cpufreq mechanism supports two kinds of drivers,
> one is the driver which has the built-in governor, like intel_pstate driver.
> For this kind of driver, the cpufreq governor is saved in the policy->policy
> field. The other is the traditional driver which is independent with cpufreq
> governor and the cpufreq governor type is saved in the governor->name field.
> For the second kind of cpufreq driver, the policy->policy field is
> meaningless and we have to read the governor name. 

That doesn't mean it's ok to bleed those internal details into KVM.  I would much
rather cpufreq provide a helper to get the effective policy, e.g.

  unsigned int cpufreq_cpu_get_policy(unsigned int cpu)
  {
	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
	unsigned int pol;

	if (!policy)
		return CPUFREQ_POLICY_UNKNOWN;

	pol = policy->policy
	if (pol == CPUFREQ_POLICY_UNKNOWN && policy->governor)
		pol = cpufreq_parse_policy(policy->governor->name);

	cpufreq_cpu_put(policy);
  }
Mi, Dapeng1 Nov. 15, 2022, 6:25 a.m. UTC | #4
> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, October 26, 2022 1:07 AM
> To: Mi, Dapeng1 <dapeng1.mi@intel.com>
> Cc: pbonzini@redhat.com; tglx@linutronix.de; mingo@redhat.com;
> dave.hansen@linux.intel.com; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; zhenyuw@linux.intel.com
> Subject: Re: [PATCH] KVM: x86: disable halt polling when powersave
> governor is used
> 
> On Sat, Oct 08, 2022, Mi, Dapeng1 wrote:
> > > > +				!strncmp(policy->governor->name,
> > > "powersave",
> > >
> > > KVM should not be comparing magic strings.  If the cpufreq subsystem
> > > can't get
> > > policy->policy right, then that needs to be fixed.
> >
> > Yeah, using magic strings looks a little bit strange, but this is what
> > is cpufreq doing.  Currently cpufreq mechanism supports two kinds of
> > drivers, one is the driver which has the built-in governor, like intel_pstate
> driver.
> > For this kind of driver, the cpufreq governor is saved in the
> > policy->policy field. The other is the traditional driver which is
> > independent with cpufreq governor and the cpufreq governor type is
> saved in the governor->name field.
> > For the second kind of cpufreq driver, the policy->policy field is
> > meaningless and we have to read the governor name.
> 
> That doesn't mean it's ok to bleed those internal details into KVM.  I would
> much rather cpufreq provide a helper to get the effective policy, e.g.
> 
>   unsigned int cpufreq_cpu_get_policy(unsigned int cpu)
>   {
> 	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> 	unsigned int pol;
> 
> 	if (!policy)
> 		return CPUFREQ_POLICY_UNKNOWN;
> 
> 	pol = policy->policy
> 	if (pol == CPUFREQ_POLICY_UNKNOWN && policy->governor)
> 		pol = cpufreq_parse_policy(policy->governor->name);
> 
> 	cpufreq_cpu_put(policy);
>   }

Thanks Sean for reviewing. Would do in next version.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7374d768296..c0eb6574cbbb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13015,7 +13015,22 @@  bool kvm_vector_hashing_enabled(void)
 
 bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
 {
-	return (vcpu->arch.msr_kvm_poll_control & 1) == 0;
+	struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu);
+	bool powersave = false;
+
+	/*
+	 * Halt polling could consume much CPU power, if CPU frequency
+	 * governor is set to "powersave", disable halt polling.
+	 */
+	if (policy) {
+		if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) ||
+			(policy->governor &&
+				!strncmp(policy->governor->name, "powersave",
+					CPUFREQ_NAME_LEN)))
+			powersave = true;
+		cpufreq_cpu_put(policy);
+	}
+	return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave;
 }
 EXPORT_SYMBOL_GPL(kvm_arch_no_poll);