diff mbox series

KVM: VMX: Nop emulation of MSR_IA32_POWER_CTL

Message ID 20190415154526.64709-1-liran.alon@oracle.com (mailing list archive)
State New, archived
Headers show
Series KVM: VMX: Nop emulation of MSR_IA32_POWER_CTL | expand

Commit Message

Liran Alon April 15, 2019, 3:45 p.m. UTC
Since commits 668fffa3f838 ("kvm: better MWAIT emulation for guests”)
and 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts”),
KVM was modified to allow an admin to configure certain guests to execute
MONITOR/MWAIT inside guest without being intercepted by host.

This is useful in case admin wishes to allocate a dedicated logical
processor for each vCPU thread. Thus, making it safe for guest to
completely control the power-state of the logical processor.

The ability to use this new KVM capability was introduced to QEMU by
commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|off”) and
2266d4431132 ("i386/cpu: make -cpu host support monitor/mwait”).

However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle
kernel module to execute c1e_promotion_disable() which will attempt to
RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable"
bit. This behaviour was introduced by commit
32e9518005c8 ("intel_idle: export both C1 and C1E”).

Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0
will cause the above guest behaviour to raise a #GP which will cause
guest to kernel panic.

Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to
avoid #GP in guest in this scenario.

Future commits can optimise emulation further by reflecting guest
MSR changes to host MSR to provide guest with the ability to
fine-tune the dedicated logical processor power-state.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 arch/x86/kvm/vmx/vmx.c | 6 ++++++
 arch/x86/kvm/vmx/vmx.h | 2 ++
 arch/x86/kvm/x86.c     | 1 +
 3 files changed, 9 insertions(+)

Comments

Sean Christopherson April 15, 2019, 6:17 p.m. UTC | #1
On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
> Since commits 668fffa3f838 ("kvm: better MWAIT emulation for guests”)
> and 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts”),
> KVM was modified to allow an admin to configure certain guests to execute
> MONITOR/MWAIT inside guest without being intercepted by host.
> 
> This is useful in case admin wishes to allocate a dedicated logical
> processor for each vCPU thread. Thus, making it safe for guest to
> completely control the power-state of the logical processor.
> 
> The ability to use this new KVM capability was introduced to QEMU by
> commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|off”) and
> 2266d4431132 ("i386/cpu: make -cpu host support monitor/mwait”).
> 
> However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle
                                                             ^^^^
                                                             its

English is a wonderful language...

> kernel module to execute c1e_promotion_disable() which will attempt to
> RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable"
> bit. This behaviour was introduced by commit
> 32e9518005c8 ("intel_idle: export both C1 and C1E”).

Technically, I think this is a Qemu bug.  KVM reports all zeros for
CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
MONITOR/MWAIT sub-features should be a separate enabling patch set.

Note, there is a virtualization hole regarding MWAIT as KVM can't
intercept MWAIT when executed with unsupported hints/features, but
I don't think that absolves Qemu of wrongdoing.

> Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0
> will cause the above guest behaviour to raise a #GP which will cause
> guest to kernel panic.
> 
> Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to
> avoid #GP in guest in this scenario.
> 
> Future commits can optimise emulation further by reflecting guest
> MSR changes to host MSR to provide guest with the ability to
> fine-tune the dedicated logical processor power-state.
> 
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 6 ++++++
>  arch/x86/kvm/vmx/vmx.h | 2 ++
>  arch/x86/kvm/x86.c     | 1 +
>  3 files changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2634ee8c9dc8..6246d782b746 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1696,6 +1696,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_SYSENTER_ESP:
>  		msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP);
>  		break;
> +	case MSR_IA32_POWER_CTL:
> +		msr_info->data = vmx->msr_ia32_power_ctl;
> +		break;
>  	case MSR_IA32_BNDCFGS:
>  		if (!kvm_mpx_supported() ||
>  		    (!msr_info->host_initiated &&
> @@ -1826,6 +1829,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_SYSENTER_ESP:
>  		vmcs_writel(GUEST_SYSENTER_ESP, data);
>  		break;
> +	case MSR_IA32_POWER_CTL:
> +		vmx->msr_ia32_power_ctl = data;
> +		break;
>  	case MSR_IA32_BNDCFGS:
>  		if (!kvm_mpx_supported() ||
>  		    (!msr_info->host_initiated &&

If KVM does go the route of advertising MWAIT/MONITOR sub-features, then I
think the MSR needs to be emulated on both Intel and AMD.  Glancing through
drivers/idle/intel_idle.c, I don't see anything that prevents it from
successfully probing an "Intel" vCPU that is being emulated on AMD hardware.

> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 99328954c2fc..e9772850a2a1 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -259,6 +259,8 @@ struct vcpu_vmx {
>  
>  	unsigned long host_debugctlmsr;
>  
> +	u64 msr_ia32_power_ctl;
> +
>  	/*
>  	 * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
>  	 * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 02c8e095a239..39ee4087f954 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1167,6 +1167,7 @@ static u32 emulated_msrs[] = {
>  	MSR_PLATFORM_INFO,
>  	MSR_MISC_FEATURES_ENABLES,
>  	MSR_AMD64_VIRT_SPEC_CTRL,
> +	MSR_IA32_POWER_CTL,
>  };
>  
>  static unsigned num_emulated_msrs;
> -- 
> 2.20.1
>
Liran Alon April 16, 2019, 3:21 p.m. UTC | #2
> On 15 Apr 2019, at 21:17, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
>> Since commits 668fffa3f838 ("kvm: better MWAIT emulation for guests”)
>> and 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts”),
>> KVM was modified to allow an admin to configure certain guests to execute
>> MONITOR/MWAIT inside guest without being intercepted by host.
>> 
>> This is useful in case admin wishes to allocate a dedicated logical
>> processor for each vCPU thread. Thus, making it safe for guest to
>> completely control the power-state of the logical processor.
>> 
>> The ability to use this new KVM capability was introduced to QEMU by
>> commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|off”) and
>> 2266d4431132 ("i386/cpu: make -cpu host support monitor/mwait”).
>> 
>> However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle
>                                                             ^^^^
>                                                             its
> 
> English is a wonderful language…

Yeah OK… :\

> 
>> kernel module to execute c1e_promotion_disable() which will attempt to
>> RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable"
>> bit. This behaviour was introduced by commit
>> 32e9518005c8 ("intel_idle: export both C1 and C1E”).
> 
> Technically, I think this is a Qemu bug.  KVM reports all zeros for
> CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
> KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
> MONITOR/MWAIT sub-features should be a separate enabling patch set.

At some point in time Jim added commit df9cb9cc5bcd ("kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation”)
which added the following paragraph to documentation:
"Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
expose cpuid features (e.g. MONITOR) which are not supported by kvm in
its default configuration. If userspace enables such capabilities, it
is responsible for modifying the results of this ioctl appropriately.”

It’s indeed not clear what it means to “modify the results of this ioctl *appropriately*” right?
It can either mean you just expose in CPUID[EAX=1].ECX support for MONITOR/MWAIT
or that you also expose CPUID_MWAIT_LEAF (CPUID[EAX=5]).
Both regardless of the value returned from KVM_GET_SUPPORTED_CPUID ioctl.

Having said that, I tend to agree with you.
Instead of emulating this MSR in KVM, I think it it preferred to change QEMU to expose MONITOR/MWAIT support in CPUID[EAX=1].ECX
but in CPUID[EAX=5] init everything as in host besides ECX[0] which will be set to 0 to report we don’t support extensions.
(We still want to support range of monitor line size, whether we can treat interrupts as break-events for MWAIT and the supported C substates).
I will create this patch for QEMU.

But in addition, this MSR emulation I think is still required. As guests may still legitimately use this MSR to enable C1E
even when MWAIT extensions is not supported. This is because just executing HLT/MWAIT on all cores in package allows
auto transition to C1E.

So to conclude I think this KVM patch is still required and that we need a QEMU patch in addition.
Do you agree?

> 
> Note, there is a virtualization hole regarding MWAIT as KVM can't
> intercept MWAIT when executed with unsupported hints/features, but
> I don't think that absolves Qemu of wrongdoing.

I think that’s OK as this will just cause the logical processor to ignore MWAIT extension register
Specifying Sub C-state and target C-state. The reason this is OK is because in SDM on MWAIT it says:
"Implementation-specific conditions may cause a processor to ignore the hint and enter a different optimized state”.
So our implementation-specific condition is to always ignore the hint & extension :D

> 
>> Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0
>> will cause the above guest behaviour to raise a #GP which will cause
>> guest to kernel panic.
>> 
>> Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to
>> avoid #GP in guest in this scenario.
>> 
>> Future commits can optimise emulation further by reflecting guest
>> MSR changes to host MSR to provide guest with the ability to
>> fine-tune the dedicated logical processor power-state.
>> 
>> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 6 ++++++
>> arch/x86/kvm/vmx/vmx.h | 2 ++
>> arch/x86/kvm/x86.c     | 1 +
>> 3 files changed, 9 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 2634ee8c9dc8..6246d782b746 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -1696,6 +1696,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 	case MSR_IA32_SYSENTER_ESP:
>> 		msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP);
>> 		break;
>> +	case MSR_IA32_POWER_CTL:
>> +		msr_info->data = vmx->msr_ia32_power_ctl;
>> +		break;
>> 	case MSR_IA32_BNDCFGS:
>> 		if (!kvm_mpx_supported() ||
>> 		    (!msr_info->host_initiated &&
>> @@ -1826,6 +1829,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 	case MSR_IA32_SYSENTER_ESP:
>> 		vmcs_writel(GUEST_SYSENTER_ESP, data);
>> 		break;
>> +	case MSR_IA32_POWER_CTL:
>> +		vmx->msr_ia32_power_ctl = data;
>> +		break;
>> 	case MSR_IA32_BNDCFGS:
>> 		if (!kvm_mpx_supported() ||
>> 		    (!msr_info->host_initiated &&
> 
> If KVM does go the route of advertising MWAIT/MONITOR sub-features, then I
> think the MSR needs to be emulated on both Intel and AMD.  Glancing through
> drivers/idle/intel_idle.c, I don't see anything that prevents it from
> successfully probing an "Intel" vCPU that is being emulated on AMD hardware.

Do we even support exposing “Intel” vCPU which is emulated using AMD SVM?
In that case, I agree and I will just move code to kvm_set_msr_common() & kvm_get_msr_common().

Thanks for the insightful review,
-Liran

> 
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index 99328954c2fc..e9772850a2a1 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -259,6 +259,8 @@ struct vcpu_vmx {
>> 
>> 	unsigned long host_debugctlmsr;
>> 
>> +	u64 msr_ia32_power_ctl;
>> +
>> 	/*
>> 	 * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
>> 	 * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 02c8e095a239..39ee4087f954 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1167,6 +1167,7 @@ static u32 emulated_msrs[] = {
>> 	MSR_PLATFORM_INFO,
>> 	MSR_MISC_FEATURES_ENABLES,
>> 	MSR_AMD64_VIRT_SPEC_CTRL,
>> +	MSR_IA32_POWER_CTL,
>> };
>> 
>> static unsigned num_emulated_msrs;
>> -- 
>> 2.20.1
>>
Paolo Bonzini April 16, 2019, 3:23 p.m. UTC | #3
On 16/04/19 17:21, Liran Alon wrote:
> 
> But in addition, this MSR emulation I think is still required. As
> guests may still legitimately use this MSR to enable C1E even when
> MWAIT extensions is not supported. This is because just executing
> HLT/MWAIT on all cores in package allows auto transition to C1E.
> 
> So to conclude I think this KVM patch is still required and that we
> need a QEMU patch in addition. Do you agree?
> 

Yes, I agree and I have queued the patch.

Paolo
Liran Alon April 16, 2019, 3:29 p.m. UTC | #4
> On 16 Apr 2019, at 18:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 16/04/19 17:21, Liran Alon wrote:
>> 
>> But in addition, this MSR emulation I think is still required. As
>> guests may still legitimately use this MSR to enable C1E even when
>> MWAIT extensions is not supported. This is because just executing
>> HLT/MWAIT on all cores in package allows auto transition to C1E.
>> 
>> So to conclude I think this KVM patch is still required and that we
>> need a QEMU patch in addition. Do you agree?
>> 
> 
> Yes, I agree and I have queued the patch.
> 
> Paolo

Thanks Paolo! :)
Have you also moved the MSR emulation to the more generic x86 kvm_set_msr_common() & kvm_get_msr_common() when queueing?
(To address Sean concern that one could emulate an Intel vCPU on top of AMD SVM…)

-Liran
Liran Alon April 16, 2019, 3:40 p.m. UTC | #5
> On 16 Apr 2019, at 18:21, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 15 Apr 2019, at 21:17, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>> 
>> On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
>> 
>> Technically, I think this is a Qemu bug.  KVM reports all zeros for
>> CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
>> KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
>> MONITOR/MWAIT sub-features should be a separate enabling patch set.
> 
> At some point in time Jim added commit df9cb9cc5bcd ("kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation”)
> which added the following paragraph to documentation:
> "Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
> expose cpuid features (e.g. MONITOR) which are not supported by kvm in
> its default configuration. If userspace enables such capabilities, it
> is responsible for modifying the results of this ioctl appropriately.”
> 
> It’s indeed not clear what it means to “modify the results of this ioctl *appropriately*” right?
> It can either mean you just expose in CPUID[EAX=1].ECX support for MONITOR/MWAIT
> or that you also expose CPUID_MWAIT_LEAF (CPUID[EAX=5]).
> Both regardless of the value returned from KVM_GET_SUPPORTED_CPUID ioctl.
> 
> Having said that, I tend to agree with you.
> Instead of emulating this MSR in KVM, I think it it preferred to change QEMU to expose MONITOR/MWAIT support in CPUID[EAX=1].ECX
> but in CPUID[EAX=5] init everything as in host besides ECX[0] which will be set to 0 to report we don’t support extensions.
> (We still want to support range of monitor line size, whether we can treat interrupts as break-events for MWAIT and the supported C substates).
> I will create this patch for QEMU.

Actually on second thought, I will just remain with the KVM patch (that Paolo was nice enough to already queue).
and not do this QEMU patch. This is because why will we want to prevent guest from specifying target C-State if he is exposed with MWAIT?
I don’t see a reason we should prevent that. Do you?

The only thing which does looks a bit weird to me is that it seems QEMU always expose extensions support even if host don’t support it.
That I don’t quite understand why. Maybe Paolo knows.

-Liran
Joao Martins April 16, 2019, 7:16 p.m. UTC | #6
On 4/16/19 4:40 PM, Liran Alon wrote:
>> On 16 Apr 2019, at 18:21, Liran Alon <liran.alon@oracle.com> wrote:
>>> On 15 Apr 2019, at 21:17, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
>>>
>>> Technically, I think this is a Qemu bug.  KVM reports all zeros for
>>> CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
>>> KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
>>> MONITOR/MWAIT sub-features should be a separate enabling patch set.
>>
>> At some point in time Jim added commit df9cb9cc5bcd ("kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation”)
>> which added the following paragraph to documentation:
>> "Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
>> expose cpuid features (e.g. MONITOR) which are not supported by kvm in
>> its default configuration. If userspace enables such capabilities, it
>> is responsible for modifying the results of this ioctl appropriately.”
>>
>> It’s indeed not clear what it means to “modify the results of this ioctl *appropriately*” right?
>> It can either mean you just expose in CPUID[EAX=1].ECX support for MONITOR/MWAIT
>> or that you also expose CPUID_MWAIT_LEAF (CPUID[EAX=5]).
>> Both regardless of the value returned from KVM_GET_SUPPORTED_CPUID ioctl.
>>
>> Having said that, I tend to agree with you.
>> Instead of emulating this MSR in KVM, I think it it preferred to change QEMU to expose MONITOR/MWAIT support in CPUID[EAX=1].ECX
>> but in CPUID[EAX=5] init everything as in host besides ECX[0] which will be set to 0 to report we don’t support extensions.
>> (We still want to support range of monitor line size, whether we can treat interrupts as break-events for MWAIT and the supported C substates).
>> I will create this patch for QEMU.
> 
> Actually on second thought, I will just remain with the KVM patch (that Paolo was nice enough to already queue).
> and not do this QEMU patch. This is because why will we want to prevent guest from specifying target C-State if he is exposed with MWAIT?
> I don’t see a reason we should prevent that. Do you?
> 
One reason it is a good idea to prevent the guest from entering deeper
C-states (e.g. deeper than C2) is to allow preemption timer to be used again
when guests are exposed with MWAIT (currently we can't do that).

Not sure if this would be towards not exposing MWAIT leaf (CPUID[EAX=5]) at all.
But at least it's one other reason I see about limiting MWAIT exposed
featureset. Userspace would zero out CPUID[EAX=5].EDX or mask the exposed host
EDX value to the first 2 C-states of host values. Even if we zero out
CPUID[EAX=5].EDX (and consequently not intel_idle does get used) guest can still
use mwait (but is limited to C1 IIUC).

See the KVM patch which handles that just below the scissors mark. But I am
waiting on a specific workload test (other than my own tests) before I formally
submit this (and by this I mean its counterpart on tests and qemu).

	Joao

------- >8 --------

From 29d69cd1e6d0f4e301dda53a1cef1b2c68d4a2d7 Mon Sep 17 00:00:00 2001
From: Joao Martins <joao.m.martins@oracle.com>
Date: Thu, Feb 21 12:48:07 2019 -0500
Subject: [PATCH] KVM: VMX: use preemption timer for C-states <= C2

According to section "25.5.1 VMX-Preemption Timer" in Volume 3C of Intel
SDM, the VMX preemption timer stops counting on C-states deeper than C2.

When mwait is exposed to the guest and MWAIT/MONITOR intercepts
are disabled, and only if the advertised MWAIT substates are
limited up to C2 (or zeroed out), we use preemption timer.

A guest which happens to ignore CPUID is going against spec,
so they can only shot themselves in the foot.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/kvm/vmx/vmx.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h |  3 +++
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3fe2020e3bc4..153957bdfa73 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -45,6 +45,7 @@
 #include <asm/mmu_context.h>
 #include <asm/mshyperv.h>
 #include <asm/spec-ctrl.h>
+#include <asm/mwait.h>
 #include <asm/virtext.h>
 #include <asm/vmx.h>

@@ -6962,6 +6963,16 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }

+static u32 vmx_get_mwait_substates(struct kvm_vcpu *vcpu)
+{
+	u32 eax, ebx, ecx, mwait_substates;
+
+	eax = CPUID_MWAIT_LEAF;
+	if (kvm_cpuid(vcpu, &eax, &ebx, &ecx, &mwait_substates, false))
+		return mwait_substates;
+	return 0;
+}
+
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6987,6 +6998,36 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 	if (boot_cpu_has(X86_FEATURE_INTEL_PT) &&
 			guest_cpuid_has(vcpu, X86_FEATURE_INTEL_PT))
 		update_intel_pt_cfg(vcpu);
+
+	if (kvm_mwait_in_guest(vcpu->kvm))
+		to_vmx(vcpu)->mwait_substates = vmx_get_mwait_substates(vcpu);
+}
+
+static bool vmx_mwait_hv_timer_reliable(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Per Intel SDM:
+	 *
+	 * Bits 03-00: Number of C0* sub C-states supported using MWAIT
+	 * Bits 07-04: Number of C1* sub C-states supported using MWAIT
+	 * Bits 11-08: Number of C2* sub C-states supported using MWAIT
+	 * Bits 15-12: Number of C3* sub C-states supported using MWAIT
+	 * Bits 19-16: Number of C4* sub C-states supported using MWAIT
+	 * Bits 23-20: Number of C5* sub C-states supported using MWAIT
+	 * Bits 27-24: Number of C6* sub C-states supported using MWAIT
+	 * Bits 31-28: Number of C7* sub C-states supported using MWAIT
+	 *
+	 * NOTE:
+	 * The definition of C0 through C7 states for MWAIT extension are
+	 * processor-specific C-states, not ACPI C-states.
+	 *
+	 * The VMX preemption timer stops counting in C-states higher than C2.
+	 * Most processors define MWAIT 0x0/0x1 as C1/C1E. MWAIT 0x10
+	 * corresponds to C3 except that on Atom processors which is C2.
+	 * So deem the first two as safe and allow preemption timer to be used.
+	 */
+#define VMX_MWAIT_CSTATES_UNSAFE 0xffffff00
+	return !(to_vmx(vcpu)->mwait_substates & VMX_MWAIT_CSTATES_UNSAFE);
 }

 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
@@ -7046,7 +7087,7 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64
guest_deadline_tsc)
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;

-	if (kvm_mwait_in_guest(vcpu->kvm))
+	if (kvm_mwait_in_guest(vcpu->kvm) && !vmx_mwait_hv_timer_reliable(vcpu))
 		return -EOPNOTSUPP;

 	vmx = to_vmx(vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1e42f983e0f1..6024ae52f90c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -257,6 +257,9 @@ struct vcpu_vmx {

 	unsigned long host_debugctlmsr;

+	/* maximum mwait sub-states */
+	u32 mwait_substates;
+
 	u64 msr_ia32_power_ctl;

 	/*
Wanpeng Li May 10, 2019, 9:54 a.m. UTC | #7
Hi all,
On Wed, 17 Apr 2019 at 03:18, Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 4/16/19 4:40 PM, Liran Alon wrote:
> >> On 16 Apr 2019, at 18:21, Liran Alon <liran.alon@oracle.com> wrote:
> >>> On 15 Apr 2019, at 21:17, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >>> On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
> >>>
> >>> Technically, I think this is a Qemu bug.  KVM reports all zeros for
> >>> CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
> >>> KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
> >>> MONITOR/MWAIT sub-features should be a separate enabling patch set.
> >>
> >> At some point in time Jim added commit df9cb9cc5bcd ("kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation”)
> >> which added the following paragraph to documentation:
> >> "Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
> >> expose cpuid features (e.g. MONITOR) which are not supported by kvm in
> >> its default configuration. If userspace enables such capabilities, it
> >> is responsible for modifying the results of this ioctl appropriately.”
> >>
> >> It’s indeed not clear what it means to “modify the results of this ioctl *appropriately*” right?
> >> It can either mean you just expose in CPUID[EAX=1].ECX support for MONITOR/MWAIT
> >> or that you also expose CPUID_MWAIT_LEAF (CPUID[EAX=5]).
> >> Both regardless of the value returned from KVM_GET_SUPPORTED_CPUID ioctl.
> >>
> >> Having said that, I tend to agree with you.
> >> Instead of emulating this MSR in KVM, I think it it preferred to change QEMU to expose MONITOR/MWAIT support in CPUID[EAX=1].ECX
> >> but in CPUID[EAX=5] init everything as in host besides ECX[0] which will be set to 0 to report we don’t support extensions.
> >> (We still want to support range of monitor line size, whether we can treat interrupts as break-events for MWAIT and the supported C substates).
> >> I will create this patch for QEMU.
> >
> > Actually on second thought, I will just remain with the KVM patch (that Paolo was nice enough to already queue).
> > and not do this QEMU patch. This is because why will we want to prevent guest from specifying target C-State if he is exposed with MWAIT?
> > I don’t see a reason we should prevent that. Do you?
> >
> One reason it is a good idea to prevent the guest from entering deeper
> C-states (e.g. deeper than C2) is to allow preemption timer to be used again
> when guests are exposed with MWAIT (currently we can't do that).

It is weird that we can observe intel_idle driver in the guest
executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW
server, however, we can't observe this on SKX/CLX server, it just
enters maximal C1. All the setups between HSW and SKX/CLX are the
same. Sean and Liran, any opinions?

Regards,
Wanpeng Li
Joao Martins May 10, 2019, 10:34 a.m. UTC | #8
On 5/10/19 10:54 AM, Wanpeng Li wrote:
> Hi all,
> On Wed, 17 Apr 2019 at 03:18, Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 4/16/19 4:40 PM, Liran Alon wrote:
>>>> On 16 Apr 2019, at 18:21, Liran Alon <liran.alon@oracle.com> wrote:
>>>>> On 15 Apr 2019, at 21:17, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>>>> On Mon, Apr 15, 2019 at 06:45:26PM +0300, Liran Alon wrote:
>>>>>
>>>>> Technically, I think this is a Qemu bug.  KVM reports all zeros for
>>>>> CPUID_MWAIT_LEAF when userspace queries KVM_GET_SUPPORTED_CPUID and
>>>>> KVM_GET_EMULATED_CPUID.  And I think that's correct/desired, supporting
>>>>> MONITOR/MWAIT sub-features should be a separate enabling patch set.
>>>>
>>>> At some point in time Jim added commit df9cb9cc5bcd ("kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation”)
>>>> which added the following paragraph to documentation:
>>>> "Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
>>>> expose cpuid features (e.g. MONITOR) which are not supported by kvm in
>>>> its default configuration. If userspace enables such capabilities, it
>>>> is responsible for modifying the results of this ioctl appropriately.”
>>>>
>>>> It’s indeed not clear what it means to “modify the results of this ioctl *appropriately*” right?
>>>> It can either mean you just expose in CPUID[EAX=1].ECX support for MONITOR/MWAIT
>>>> or that you also expose CPUID_MWAIT_LEAF (CPUID[EAX=5]).
>>>> Both regardless of the value returned from KVM_GET_SUPPORTED_CPUID ioctl.
>>>>
>>>> Having said that, I tend to agree with you.
>>>> Instead of emulating this MSR in KVM, I think it it preferred to change QEMU to expose MONITOR/MWAIT support in CPUID[EAX=1].ECX
>>>> but in CPUID[EAX=5] init everything as in host besides ECX[0] which will be set to 0 to report we don’t support extensions.
>>>> (We still want to support range of monitor line size, whether we can treat interrupts as break-events for MWAIT and the supported C substates).
>>>> I will create this patch for QEMU.
>>>
>>> Actually on second thought, I will just remain with the KVM patch (that Paolo was nice enough to already queue).
>>> and not do this QEMU patch. This is because why will we want to prevent guest from specifying target C-State if he is exposed with MWAIT?
>>> I don’t see a reason we should prevent that. Do you?
>>>
>> One reason it is a good idea to prevent the guest from entering deeper
>> C-states (e.g. deeper than C2) is to allow preemption timer to be used again
>> when guests are exposed with MWAIT (currently we can't do that).
> 
> It is weird that we can observe intel_idle driver in the guest
> executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW
> server, however, we can't observe this on SKX/CLX server, it just
> enters maximal C1. 

I assume you refer to the case where you pass the host mwait substates to the
guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX
like my patch (attached in the previous message) suggests?

Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on
intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management"
in instruction set reference M-U) the hints register, doesn't necessarily
guarantee the specified C-state depicted in the hints will be used. The manual
makes it sound like it is tentative, and implementation-specific condition may
either ignore it or enter a different one. It appears to be only guaranteed that
it won't enter a C-{sub,}state deeper than the one depicted.

> All the setups between HSW and SKX/CLX are the
> same. Sean and Liran, any opinions?
> 
> Regards,
> Wanpeng Li
>
Sean Christopherson May 10, 2019, 5:17 p.m. UTC | #9
On Fri, May 10, 2019 at 11:34:41AM +0100, Joao Martins wrote:
> On 5/10/19 10:54 AM, Wanpeng Li wrote:
> > It is weird that we can observe intel_idle driver in the guest
> > executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW
> > server, however, we can't observe this on SKX/CLX server, it just
> > enters maximal C1. 
> 
> I assume you refer to the case where you pass the host mwait substates to the
> guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX
> like my patch (attached in the previous message) suggests?
> 
> Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on
> intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management"
> in instruction set reference M-U) the hints register, doesn't necessarily
> guarantee the specified C-state depicted in the hints will be used. The manual
> makes it sound like it is tentative, and implementation-specific condition may
> either ignore it or enter a different one. It appears to be only guaranteed that
> it won't enter a C-{sub,}state deeper than the one depicted.

Yep, section "MWAIT EXTENSIONS FOR ADVANCED POWER MANAGEMENT" is more
explicit on this point:

  At CPL=0, system software can specify desired C-state and sub C-state by
  using the MWAIT hints register (EAX).  Processors will not go to C-state
  and sub C-state deeper than what is specified by the hint register.

As for why SKX/CLX only enters C1, AFAICT SKX isn't configured to support
C3, e.g. skx_cstates in drivers/idle/intel_idle.c shows C1, C1E and C6.
A quick search brings up a variety of docs that confirm this.  My guess is
that C1E provides better power/performance than C3 for the majority of
server workloads, e.g. C3 doesn't provide enough power savings to justify
its higher latency and TLB flush.
Wanpeng Li May 13, 2019, 9:13 a.m. UTC | #10
On Sat, 11 May 2019 at 01:17, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, May 10, 2019 at 11:34:41AM +0100, Joao Martins wrote:
> > On 5/10/19 10:54 AM, Wanpeng Li wrote:
> > > It is weird that we can observe intel_idle driver in the guest
> > > executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW
> > > server, however, we can't observe this on SKX/CLX server, it just
> > > enters maximal C1.
> >
> > I assume you refer to the case where you pass the host mwait substates to the
> > guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX
> > like my patch (attached in the previous message) suggests?
> >
> > Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on
> > intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management"
> > in instruction set reference M-U) the hints register, doesn't necessarily
> > guarantee the specified C-state depicted in the hints will be used. The manual
> > makes it sound like it is tentative, and implementation-specific condition may
> > either ignore it or enter a different one. It appears to be only guaranteed that
> > it won't enter a C-{sub,}state deeper than the one depicted.
>
> Yep, section "MWAIT EXTENSIONS FOR ADVANCED POWER MANAGEMENT" is more
> explicit on this point:
>
>   At CPL=0, system software can specify desired C-state and sub C-state by
>   using the MWAIT hints register (EAX).  Processors will not go to C-state
>   and sub C-state deeper than what is specified by the hint register.
>
> As for why SKX/CLX only enters C1, AFAICT SKX isn't configured to support
> C3, e.g. skx_cstates in drivers/idle/intel_idle.c shows C1, C1E and C6.
> A quick search brings up a variety of docs that confirm this.  My guess is
> that C1E provides better power/performance than C3 for the majority of
> server workloads, e.g. C3 doesn't provide enough power savings to justify
> its higher latency and TLB flush.

You are right, I figure this out by referring to the SKX/CLX EDS, the
Core C-States of these two generations just support CC0/CC1/CC1E/CC6.
The issue here is after exposing mwait to the guest, SKX/CLX guest
can't enter CC6, however, HSW guest can enter CC3/CC6. Both HSW and
SKX/CLX hosts can enter CC6. We observe SKX/CLX guests execute mwait
eax 0x20, however, we can't observe the corresponding pCPU enter CC6
by turbostat or reading MSR_CORE_C6_RESIDENCY directly.

Regards,
Wanpeng Li
Sean Christopherson May 15, 2019, 2:30 p.m. UTC | #11
On Mon, May 13, 2019 at 05:13:29PM +0800, Wanpeng Li wrote:
> On Sat, 11 May 2019 at 01:17, Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Fri, May 10, 2019 at 11:34:41AM +0100, Joao Martins wrote:
> > > On 5/10/19 10:54 AM, Wanpeng Li wrote:
> > > > It is weird that we can observe intel_idle driver in the guest
> > > > executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW
> > > > server, however, we can't observe this on SKX/CLX server, it just
> > > > enters maximal C1.
> > >
> > > I assume you refer to the case where you pass the host mwait substates to the
> > > guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX
> > > like my patch (attached in the previous message) suggests?
> > >
> > > Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on
> > > intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management"
> > > in instruction set reference M-U) the hints register, doesn't necessarily
> > > guarantee the specified C-state depicted in the hints will be used. The manual
> > > makes it sound like it is tentative, and implementation-specific condition may
> > > either ignore it or enter a different one. It appears to be only guaranteed that
> > > it won't enter a C-{sub,}state deeper than the one depicted.
> >
> > Yep, section "MWAIT EXTENSIONS FOR ADVANCED POWER MANAGEMENT" is more
> > explicit on this point:
> >
> >   At CPL=0, system software can specify desired C-state and sub C-state by
> >   using the MWAIT hints register (EAX).  Processors will not go to C-state
> >   and sub C-state deeper than what is specified by the hint register.
> >
> > As for why SKX/CLX only enters C1, AFAICT SKX isn't configured to support
> > C3, e.g. skx_cstates in drivers/idle/intel_idle.c shows C1, C1E and C6.
> > A quick search brings up a variety of docs that confirm this.  My guess is
> > that C1E provides better power/performance than C3 for the majority of
> > server workloads, e.g. C3 doesn't provide enough power savings to justify
> > its higher latency and TLB flush.
> 
> You are right, I figure this out by referring to the SKX/CLX EDS, the
> Core C-States of these two generations just support CC0/CC1/CC1E/CC6.
> The issue here is after exposing mwait to the guest, SKX/CLX guest
> can't enter CC6, however, HSW guest can enter CC3/CC6. Both HSW and
> SKX/CLX hosts can enter CC6. We observe SKX/CLX guests execute mwait
> eax 0x20, however, we can't observe the corresponding pCPU enter CC6
> by turbostat or reading MSR_CORE_C6_RESIDENCY directly.

It's likely that the CPU is operating as expected and isn't dropping into
the deeper sleep state because of some heuristic or wake event.  It might
be something as simple as the combination of periodic tick interrupts
between host and guest occuring too frequently (to get to C6), or it could
be a much more complex scenario.  It's been several years since I've done
anything close to hands on debug with C-states, I have no idea what
capabilities are available to help debug this sort of thing.  Sorry I
can't be more helpful.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2634ee8c9dc8..6246d782b746 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1696,6 +1696,9 @@  static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_SYSENTER_ESP:
 		msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP);
 		break;
+	case MSR_IA32_POWER_CTL:
+		msr_info->data = vmx->msr_ia32_power_ctl;
+		break;
 	case MSR_IA32_BNDCFGS:
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
@@ -1826,6 +1829,9 @@  static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_SYSENTER_ESP:
 		vmcs_writel(GUEST_SYSENTER_ESP, data);
 		break;
+	case MSR_IA32_POWER_CTL:
+		vmx->msr_ia32_power_ctl = data;
+		break;
 	case MSR_IA32_BNDCFGS:
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 99328954c2fc..e9772850a2a1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -259,6 +259,8 @@  struct vcpu_vmx {
 
 	unsigned long host_debugctlmsr;
 
+	u64 msr_ia32_power_ctl;
+
 	/*
 	 * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
 	 * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02c8e095a239..39ee4087f954 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1167,6 +1167,7 @@  static u32 emulated_msrs[] = {
 	MSR_PLATFORM_INFO,
 	MSR_MISC_FEATURES_ENABLES,
 	MSR_AMD64_VIRT_SPEC_CTRL,
+	MSR_IA32_POWER_CTL,
 };
 
 static unsigned num_emulated_msrs;