Message ID | 20220915073121.1038840-1-dapeng1.mi@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: disable halt polling when powersave governor is used | expand |
On Thu, Sep 15, 2022, Dapeng Mi wrote: > Halt polling is enabled by default even through the CPU frequency > governor is configured to powersave. Generally halt polling would > consume extra power and this's not identical with the intent of > powersave governor. > > disabling halt polling in powersave governor can save the precious > power in power critical case. > > FIO random read test on Alder Lake platform shows halt polling > occupies ~17% CPU utilization and consume 7% extra CPU power. > After disabling halt polling, CPU has more chance to enter deeper > C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%). > > On Alder Lake platform, we don't find there are obvious performance > downgrade after disabling halt polling on FIO and Netperf cases. > Netperf UDP_RR case runs from two VMs locate on two different physical > machines. > > FIO(MB/s) Base Disable-halt-polling Delta% > Rand-read 432.6 436.3 0.8% > > Netperf Base Disable-halt-polling Delta% > UDP_RR 509.8 508.5 -0.3% > > Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com> > --- > arch/x86/kvm/x86.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d7374d768296..c0eb6574cbbb 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void) > > bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) > { > - return (vcpu->arch.msr_kvm_poll_control & 1) == 0; > + struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu); Preemption is not disabled at this point, which means that using vcpu->cpu is potentially unsafe. Given that cpufreq is refcounting the returned object, I gotta imaging get migrated to a different pCPU would be problematic. > + bool powersave = false; I don't see anything in here that's x86 specific. Unless I'm missing something, this belongs in common KVM. > + > + /* > + * Halt polling could consume much CPU power, if CPU frequency > + * governor is set to "powersave", disable halt polling. > + */ > + if (policy) { > + if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) || > + (policy->governor && Indentation is messed up. > + !strncmp(policy->governor->name, "powersave", KVM should not be comparing magic strings. If the cpufreq subsystem can't get policy->policy right, then that needs to be fixed. > + CPUFREQ_NAME_LEN))) > + powersave = true; > + cpufreq_cpu_put(policy); > + } > + return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave; Doing all of the above work if polling is disabled is silly. > } > EXPORT_SYMBOL_GPL(kvm_arch_no_poll); All in all, _if_ we want to do this automatically and not let userspace decide how to manage powersave vs. halt-poll, I think this should be more like: --- virt/kvm/kvm_main.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e30f1b4ecfa5..01116859cb31 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -29,6 +29,7 @@ #include <linux/file.h> #include <linux/syscore_ops.h> #include <linux/cpu.h> +#include <linux/cpufreq.h> #include <linux/sched/signal.h> #include <linux/sched/mm.h> #include <linux/sched/stat.h> @@ -3483,6 +3484,23 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start, } } +static bool kvm_cpufreq_no_halt_poll(struct kvm_vcpu *vcpu) +{ + struct cpufreq_policy *policy; + bool powersave = false; + + preempt_disable(); + + policy = cpufreq_cpu_get(vcpu->cpu); + if (policy) { + powersave = (policy->policy == CPUFREQ_POLICY_POWERSAVE); + cpufreq_cpu_put(policy); + } + + preempt_enable(); + return powersave; +} + /* * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc... If halt * polling is enabled, busy wait for a short time before blocking to avoid the @@ -3491,7 +3509,8 @@ static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start, */ void kvm_vcpu_halt(struct kvm_vcpu *vcpu) { - bool halt_poll_allowed = !kvm_arch_no_poll(vcpu); + const bool halt_poll_allowed = !kvm_arch_no_poll(vcpu) && + !kvm_cpufreq_no_halt_poll(vcpu); bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns; ktime_t start, cur, poll_end; bool waited = false; base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354 --
> From: Sean Christopherson <seanjc@google.com> > Sent: Saturday, October 8, 2022 1:52 AM > To: Mi, Dapeng1 <dapeng1.mi@intel.com> > Cc: pbonzini@redhat.com; tglx@linutronix.de; mingo@redhat.com; > dave.hansen@linux.intel.com; kvm@vger.kernel.org; linux- > kernel@vger.kernel.org; zhenyuw@linux.intel.com > Subject: Re: [PATCH] KVM: x86: disable halt polling when powersave governor is > used > > On Thu, Sep 15, 2022, Dapeng Mi wrote: > > Halt polling is enabled by default even through the CPU frequency > > governor is configured to powersave. Generally halt polling would > > consume extra power and this's not identical with the intent of > > powersave governor. > > > > disabling halt polling in powersave governor can save the precious > > power in power critical case. > > > > FIO random read test on Alder Lake platform shows halt polling > > occupies ~17% CPU utilization and consume 7% extra CPU power. > > After disabling halt polling, CPU has more chance to enter deeper > > C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%). > > > > On Alder Lake platform, we don't find there are obvious performance > > downgrade after disabling halt polling on FIO and Netperf cases. > > Netperf UDP_RR case runs from two VMs locate on two different physical > > machines. > > > > FIO(MB/s) Base Disable-halt-polling Delta% > > Rand-read 432.6 436.3 0.8% > > > > Netperf Base Disable-halt-polling Delta% > > UDP_RR 509.8 508.5 -0.3% > > > > Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com> > > --- > > arch/x86/kvm/x86.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index > > d7374d768296..c0eb6574cbbb 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void) > > > > bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) { > > - return (vcpu->arch.msr_kvm_poll_control & 1) == 0; > > + struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu); > > Preemption is not disabled at this point, which means that using vcpu->cpu is > potentially unsafe. Given that cpufreq is refcounting the returned object, I gotta > imaging get migrated to a different pCPU would be problematic. Thanks for pointing this. Per my learning, even vCPU migrates to a different pCPU in the progress of getting the cpufreq policy, the only consequence is to get an outdated (maybe inaccurate) policy value. Halt polling mechanism still can get the updated and correct cpufreq policy in next time. And even we disable preemption in the process of obtaining cpufreq policy, the vCPU is still possible to be migrated a different pCPU after enabling preemption and before calling the halt polling judging logic. > > > + bool powersave = false; > > I don't see anything in here that's x86 specific. Unless I'm missing something, > this belongs in common KVM. > Yes, this is generic. > > + > > + /* > > + * Halt polling could consume much CPU power, if CPU frequency > > + * governor is set to "powersave", disable halt polling. > > + */ > > + if (policy) { > > + if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) || > > + (policy->governor && > > Indentation is messed up. Sure. Would change. > > > + !strncmp(policy->governor->name, > "powersave", > > KVM should not be comparing magic strings. If the cpufreq subsystem can't get > policy->policy right, then that needs to be fixed. Yeah, using magic strings looks a little bit strange, but this is what is cpufreq doing. Currently cpufreq mechanism supports two kinds of drivers, one is the driver which has the built-in governor, like intel_pstate driver. For this kind of driver, the cpufreq governor is saved in the policy->policy field. The other is the traditional driver which is independent with cpufreq governor and the cpufreq governor type is saved in the governor->name field. For the second kind of cpufreq driver, the policy->policy field is meaningless and we have to read the governor name. > > > + CPUFREQ_NAME_LEN))) > > + powersave = true; > > + cpufreq_cpu_put(policy); > > + } > > + return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave; > > Doing all of the above work if polling is disabled is silly. Correct. Would change. > > > } > > EXPORT_SYMBOL_GPL(kvm_arch_no_poll); > > All in all, _if_ we want to do this automatically and not let userspace decide how > to manage powersave vs. halt-poll, I think this should be more like: Thanks for your sample. Would change in V2 patch. > > --- > virt/kvm/kvm_main.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > e30f1b4ecfa5..01116859cb31 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -29,6 +29,7 @@ > #include <linux/file.h> > #include <linux/syscore_ops.h> > #include <linux/cpu.h> > +#include <linux/cpufreq.h> > #include <linux/sched/signal.h> > #include <linux/sched/mm.h> > #include <linux/sched/stat.h> > @@ -3483,6 +3484,23 @@ static inline void update_halt_poll_stats(struct > kvm_vcpu *vcpu, ktime_t start, > } > } > > +static bool kvm_cpufreq_no_halt_poll(struct kvm_vcpu *vcpu) { > + struct cpufreq_policy *policy; > + bool powersave = false; > + > + preempt_disable(); > + > + policy = cpufreq_cpu_get(vcpu->cpu); > + if (policy) { > + powersave = (policy->policy == CPUFREQ_POLICY_POWERSAVE); > + cpufreq_cpu_put(policy); > + } > + > + preempt_enable(); > + return powersave; > +} > + > /* > * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc... If halt > * polling is enabled, busy wait for a short time before blocking to avoid the > @@ -3491,7 +3509,8 @@ static inline void update_halt_poll_stats(struct > kvm_vcpu *vcpu, ktime_t start, > */ > void kvm_vcpu_halt(struct kvm_vcpu *vcpu) { > - bool halt_poll_allowed = !kvm_arch_no_poll(vcpu); > + const bool halt_poll_allowed = !kvm_arch_no_poll(vcpu) && > + !kvm_cpufreq_no_halt_poll(vcpu); > bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns; > ktime_t start, cur, poll_end; > bool waited = false; > > base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354 > --
On Sat, Oct 08, 2022, Mi, Dapeng1 wrote: > > > + !strncmp(policy->governor->name, > > "powersave", > > > > KVM should not be comparing magic strings. If the cpufreq subsystem can't get > > policy->policy right, then that needs to be fixed. > > Yeah, using magic strings looks a little bit strange, but this is what is > cpufreq doing. Currently cpufreq mechanism supports two kinds of drivers, > one is the driver which has the built-in governor, like intel_pstate driver. > For this kind of driver, the cpufreq governor is saved in the policy->policy > field. The other is the traditional driver which is independent with cpufreq > governor and the cpufreq governor type is saved in the governor->name field. > For the second kind of cpufreq driver, the policy->policy field is > meaningless and we have to read the governor name. That doesn't mean it's ok to bleed those internal details into KVM. I would much rather cpufreq provide a helper to get the effective policy, e.g. unsigned int cpufreq_cpu_get_policy(unsigned int cpu) { struct cpufreq_policy *policy = cpufreq_cpu_get(cpu); unsigned int pol; if (!policy) return CPUFREQ_POLICY_UNKNOWN; pol = policy->policy if (pol == CPUFREQ_POLICY_UNKNOWN && policy->governor) pol = cpufreq_parse_policy(policy->governor->name); cpufreq_cpu_put(policy); }
> From: Sean Christopherson <seanjc@google.com> > Sent: Wednesday, October 26, 2022 1:07 AM > To: Mi, Dapeng1 <dapeng1.mi@intel.com> > Cc: pbonzini@redhat.com; tglx@linutronix.de; mingo@redhat.com; > dave.hansen@linux.intel.com; kvm@vger.kernel.org; linux- > kernel@vger.kernel.org; zhenyuw@linux.intel.com > Subject: Re: [PATCH] KVM: x86: disable halt polling when powersave > governor is used > > On Sat, Oct 08, 2022, Mi, Dapeng1 wrote: > > > > + !strncmp(policy->governor->name, > > > "powersave", > > > > > > KVM should not be comparing magic strings. If the cpufreq subsystem > > > can't get > > > policy->policy right, then that needs to be fixed. > > > > Yeah, using magic strings looks a little bit strange, but this is what > > is cpufreq doing. Currently cpufreq mechanism supports two kinds of > > drivers, one is the driver which has the built-in governor, like intel_pstate > driver. > > For this kind of driver, the cpufreq governor is saved in the > > policy->policy field. The other is the traditional driver which is > > independent with cpufreq governor and the cpufreq governor type is > saved in the governor->name field. > > For the second kind of cpufreq driver, the policy->policy field is > > meaningless and we have to read the governor name. > > That doesn't mean it's ok to bleed those internal details into KVM. I would > much rather cpufreq provide a helper to get the effective policy, e.g. > > unsigned int cpufreq_cpu_get_policy(unsigned int cpu) > { > struct cpufreq_policy *policy = cpufreq_cpu_get(cpu); > unsigned int pol; > > if (!policy) > return CPUFREQ_POLICY_UNKNOWN; > > pol = policy->policy > if (pol == CPUFREQ_POLICY_UNKNOWN && policy->governor) > pol = cpufreq_parse_policy(policy->governor->name); > > cpufreq_cpu_put(policy); > } Thanks Sean for reviewing. Would do in next version.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d7374d768296..c0eb6574cbbb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void) bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) { - return (vcpu->arch.msr_kvm_poll_control & 1) == 0; + struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu); + bool powersave = false; + + /* + * Halt polling could consume much CPU power, if CPU frequency + * governor is set to "powersave", disable halt polling. + */ + if (policy) { + if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) || + (policy->governor && + !strncmp(policy->governor->name, "powersave", + CPUFREQ_NAME_LEN))) + powersave = true; + cpufreq_cpu_put(policy); + } + return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave; } EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
Halt polling is enabled by default even through the CPU frequency governor is configured to powersave. Generally halt polling would consume extra power and this's not identical with the intent of powersave governor. disabling halt polling in powersave governor can save the precious power in power critical case. FIO random read test on Alder Lake platform shows halt polling occupies ~17% CPU utilization and consume 7% extra CPU power. After disabling halt polling, CPU has more chance to enter deeper C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%). On Alder Lake platform, we don't find there are obvious performance downgrade after disabling halt polling on FIO and Netperf cases. Netperf UDP_RR case runs from two VMs locate on two different physical machines. FIO(MB/s) Base Disable-halt-polling Delta% Rand-read 432.6 436.3 0.8% Netperf Base Disable-halt-polling Delta% UDP_RR 509.8 508.5 -0.3% Signed-off-by: Dapeng Mi <dapeng1.mi@intel.com> --- arch/x86/kvm/x86.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)