[v4] kvm: better MWAIT emulation for guests
diff mbox

Message ID 1489605443-21045-1-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin March 15, 2017, 7:28 p.m. UTC
Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
unless explicitly provided with kernel command line argument
"idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
without checking CPUID.

We currently emulate that as a NOP but on VMX we can do better: let
guest stop the CPU until timer, IPI or memory change.  CPU will be busy
but that isn't any worse than a NOP emulation.

Note that mwait within guests is not the same as on real hardware
because halt causes an exit while mwait doesn't.  For this reason it
might not be a good idea to use the regular MWAIT flag in CPUID to
signal this capability.  Add a flag in the hypervisor leaf instead.

Additionally, we add a capability for QEMU - e.g. if it knows there's an
isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
to improve guest behaviour.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Note: SVM bits are untested at this point. Seems pretty
obvious though.

changes from v3:
- don't enable capability if cli+mwait blocks interrupts
- doc typo fixes (drop drop ppc doc)

changes from v2:
- add a capability to allow host userspace to detect new kernels
- more documentation to clarify the semantics of the feature flag
  and why it's useful
- svm support as suggested by Radim

changes from v1:
- typo fix resulting in rest of leaf flags being overwritten
  Reported by: Wanpeng Li <kernellwp@gmail.com>
- updated commit log with data about guests helped by this feature
- better document differences between mwait and halt for guests


 Documentation/virtual/kvm/api.txt    |  9 +++++++++
 Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/cpuid.c                 |  3 +++
 arch/x86/kvm/svm.c                   |  2 --
 arch/x86/kvm/vmx.c                   |  6 ++++--
 arch/x86/kvm/x86.c                   |  3 +++
 arch/x86/kvm/x86.h                   | 25 +++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  1 +
 9 files changed, 52 insertions(+), 4 deletions(-)

Comments

Radim Krčmář March 15, 2017, 8:13 p.m. UTC | #1
2017-03-15 21:28+0200, Michael S. Tsirkin:
> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> unless explicitly provided with kernel command line argument
> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> without checking CPUID.
> 
> We currently emulate that as a NOP but on VMX we can do better: let
> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> but that isn't any worse than a NOP emulation.
> 
> Note that mwait within guests is not the same as on real hardware
> because halt causes an exit while mwait doesn't.  For this reason it
> might not be a good idea to use the regular MWAIT flag in CPUID to
> signal this capability.  Add a flag in the hypervisor leaf instead.
> 
> Additionally, we add a capability for QEMU - e.g. if it knows there's an
> isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> to improve guest behaviour.
> 
> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Note: SVM bits are untested at this point. Seems pretty
> obvious though.
> 
> changes from v3:
> - don't enable capability if cli+mwait blocks interrupts
> - doc typo fixes (drop drop ppc doc)
> 
> changes from v2:
> - add a capability to allow host userspace to detect new kernels
> - more documentation to clarify the semantics of the feature flag
>   and why it's useful
> - svm support as suggested by Radim
> 
> changes from v1:
> - typo fix resulting in rest of leaf flags being overwritten
>   Reported by: Wanpeng Li <kernellwp@gmail.com>
> - updated commit log with data about guests helped by this feature
> - better document differences between mwait and halt for guests
> 
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>  	    __rem;						\
>  	 })
>  
> +static bool kvm_mwait_in_guest(void)
> +{
> +	unsigned int eax, ebx, ecx;
> +
> +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> +		return -ENODEV;
> +
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +		return -ENODEV;
> +
> +	/*
> +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> +	 * they would allow guest to stop the CPU completely by disabling
> +	 * interrupts then invoking MWAIT.
> +	 */
> +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> +		return -ENODEV;
> +
> +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
> +
> +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +		return -ENODEV;

The guest is still able to set ecx=0 with MWAIT, which should be the
same as not having the CPUID flag, so I'm wondering how this check
prevents anything harmful ... is it really a cpu "feature"?

If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
hang that Gabriel is hitting.

Gabriel,

 - do you see VM exits on the "hung" VCPU?
 - what is your CPU model?
 - what do you get after running this C program on host and guest?

   #include <stdint.h>
   #include <stdio.h>
   
   int main(void) {
   	uint32_t eax = 5, ebx, ecx = 0, edx;
   	asm ("cpuid" : "+a"(eax), "=b"(ebx), "+c"(ecx), "=d"(edx));
   
   	printf("eax=%#08x ebx=%#08x ecx=%#08x edx=%#08x\n", eax, ebx, ecx, edx);
   
   	return 0;
   }

Thanks.
Gabriel L. Somlo March 15, 2017, 8:21 p.m. UTC | #2
On Wed, Mar 15, 2017 at 09:13:49PM +0100, Radim Krčmář wrote:
> 2017-03-15 21:28+0200, Michael S. Tsirkin:
> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > unless explicitly provided with kernel command line argument
> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > without checking CPUID.
> > 
> > We currently emulate that as a NOP but on VMX we can do better: let
> > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> > but that isn't any worse than a NOP emulation.
> > 
> > Note that mwait within guests is not the same as on real hardware
> > because halt causes an exit while mwait doesn't.  For this reason it
> > might not be a good idea to use the regular MWAIT flag in CPUID to
> > signal this capability.  Add a flag in the hypervisor leaf instead.
> > 
> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > to improve guest behaviour.
> > 
> > Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Note: SVM bits are untested at this point. Seems pretty
> > obvious though.
> > 
> > changes from v3:
> > - don't enable capability if cli+mwait blocks interrupts
> > - doc typo fixes (drop drop ppc doc)
> > 
> > changes from v2:
> > - add a capability to allow host userspace to detect new kernels
> > - more documentation to clarify the semantics of the feature flag
> >   and why it's useful
> > - svm support as suggested by Radim
> > 
> > changes from v1:
> > - typo fix resulting in rest of leaf flags being overwritten
> >   Reported by: Wanpeng Li <kernellwp@gmail.com>
> > - updated commit log with data about guests helped by this feature
> > - better document differences between mwait and halt for guests
> > 
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> >  	    __rem;						\
> >  	 })
> >  
> > +static bool kvm_mwait_in_guest(void)
> > +{
> > +	unsigned int eax, ebx, ecx;
> > +
> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> > +		return -ENODEV;
> > +
> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> > +		return -ENODEV;
> > +
> > +	/*
> > +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> > +	 * they would allow guest to stop the CPU completely by disabling
> > +	 * interrupts then invoking MWAIT.
> > +	 */
> > +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> > +		return -ENODEV;
> > +
> > +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
> > +
> > +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> > +		return -ENODEV;
> 
> The guest is still able to set ecx=0 with MWAIT, which should be the
> same as not having the CPUID flag, so I'm wondering how this check
> prevents anything harmful ... is it really a cpu "feature"?
> 
> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
> hang that Gabriel is hitting.
> 
> Gabriel,
> 
>  - do you see VM exits on the "hung" VCPU?

how would I go about looking ?

>  - what is your CPU model?

$ cat /proc/cpuinfo
...
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
microcode       : 0xd2
cpu MHz         : 2659.966
cache size      : 4096 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
bugs            :
bogomips        : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

(this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
testing, to avoid having to reboot my primary desktop :)

>  - what do you get after running this C program on host and guest?
> 
>    #include <stdint.h>
>    #include <stdio.h>
>    
>    int main(void) {
>    	uint32_t eax = 5, ebx, ecx = 0, edx;
>    	asm ("cpuid" : "+a"(eax), "=b"(ebx), "+c"(ecx), "=d"(edx));
>    
>    	printf("eax=%#08x ebx=%#08x ecx=%#08x edx=%#08x\n", eax, ebx, ecx, edx);
>    
>    	return 0;
>    }

eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020

HTH,
--G
Gabriel L. Somlo March 15, 2017, 8:25 p.m. UTC | #3
On Wed, Mar 15, 2017 at 04:21:41PM -0400, Gabriel L. Somlo wrote:
> > 
> >  - do you see VM exits on the "hung" VCPU?
> 
> how would I go about looking ?
> 
> >  - what is your CPU model?
> 
> $ cat /proc/cpuinfo
> ...
> processor       : 3
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> stepping        : 6
> microcode       : 0xd2
> cpu MHz         : 2659.966
> cache size      : 4096 KB
> physical id     : 3
> siblings        : 2
> core id         : 0
> cpu cores       : 2
> apicid          : 6
> initial apicid  : 6
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
> bugs            :
> bogomips        : 5320.04
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
> 
> (this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
> testing, to avoid having to reboot my primary desktop :)

Oh, in case you meant what I use on the qemu command line:

qemu-system-x86_64 -enable-kvm -m 2048 -cpu core2duo \
  -machine q35 -smp 4,cores=2 \
  -device isa-applesmc,osk="...osk.value.from.smc.goes.here..." \
  -usb -device usb-kbd -device usb-mouse \
  -smbios type=2 -kernel ./chameleon_boot \
  -device ide-drive,bus=ide.2,drive=MacHDD \
  -drive id=MacHDD,if=none,snapshot=on,file=./mac_10.7.img \
  -monitor stdio

so, core2duo -- it's the only thing OS X 10.7 will find acceptable.

> 
> >  - what do you get after running this C program on host and guest?
> > 
> >    #include <stdint.h>
> >    #include <stdio.h>
> >    
> >    int main(void) {
> >    	uint32_t eax = 5, ebx, ecx = 0, edx;
> >    	asm ("cpuid" : "+a"(eax), "=b"(ebx), "+c"(ecx), "=d"(edx));
> >    
> >    	printf("eax=%#08x ebx=%#08x ecx=%#08x edx=%#08x\n", eax, ebx, ecx, edx);
> >    
> >    	return 0;
> >    }
> 
> eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020
> 
> HTH,
> --G
Radim Krčmář March 15, 2017, 8:46 p.m. UTC | #4
2017-03-15 16:21-0400, Gabriel L. Somlo:
> On Wed, Mar 15, 2017 at 09:13:49PM +0100, Radim Krčmář wrote:
>> 2017-03-15 21:28+0200, Michael S. Tsirkin:
>> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> > unless explicitly provided with kernel command line argument
>> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> > without checking CPUID.
>> > 
>> > We currently emulate that as a NOP but on VMX we can do better: let
>> > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>> > but that isn't any worse than a NOP emulation.
>> > 
>> > Note that mwait within guests is not the same as on real hardware
>> > because halt causes an exit while mwait doesn't.  For this reason it
>> > might not be a good idea to use the regular MWAIT flag in CPUID to
>> > signal this capability.  Add a flag in the hypervisor leaf instead.
>> > 
>> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
>> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
>> > to improve guest behaviour.
>> > 
>> > Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
>> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> > ---
>> > 
>> > Note: SVM bits are untested at this point. Seems pretty
>> > obvious though.
>> > 
>> > changes from v3:
>> > - don't enable capability if cli+mwait blocks interrupts
>> > - doc typo fixes (drop drop ppc doc)
>> > 
>> > changes from v2:
>> > - add a capability to allow host userspace to detect new kernels
>> > - more documentation to clarify the semantics of the feature flag
>> >   and why it's useful
>> > - svm support as suggested by Radim
>> > 
>> > changes from v1:
>> > - typo fix resulting in rest of leaf flags being overwritten
>> >   Reported by: Wanpeng Li <kernellwp@gmail.com>
>> > - updated commit log with data about guests helped by this feature
>> > - better document differences between mwait and halt for guests
>> > 
>> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> > @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>> >  	    __rem;						\
>> >  	 })
>> >  
>> > +static bool kvm_mwait_in_guest(void)
>> > +{
>> > +	unsigned int eax, ebx, ecx;
>> > +
>> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
>> > +		return -ENODEV;
>> > +
>> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>> > +		return -ENODEV;
>> > +
>> > +	/*
>> > +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
>> > +	 * they would allow guest to stop the CPU completely by disabling
>> > +	 * interrupts then invoking MWAIT.
>> > +	 */
>> > +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
>> > +		return -ENODEV;
>> > +
>> > +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
>> > +
>> > +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
>> > +		return -ENODEV;
>> 
>> The guest is still able to set ecx=0 with MWAIT, which should be the
>> same as not having the CPUID flag, so I'm wondering how this check
>> prevents anything harmful ... is it really a cpu "feature"?
>> 
>> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
>> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
>> hang that Gabriel is hitting.
>> 
>> Gabriel,
>> 
>>  - do you see VM exits on the "hung" VCPU?
> 
> how would I go about looking ?

Probably the simplest would be to try install trace-cmd and do

  trace-cmd record -e kvm:kvm_entry

for a while, then killing it with ^C and calling

  trace-cmd report

it will have lines like:

  CPU-3729  [021]  5046.222480: kvm_entry:            vcpu 7

the first column is task id, the last one is VCPU id, so you'd verify that all
VCPUs eventually enter.
You could either look at the taks id of VCPUs that are running 100% of the time
or just pipe through `grep -o 'vcpu.*' | sort -u` and hope that all of them are
running, so you'd see nice list of them all. :)

If you don't see some VCPUs there, they might have been halted, which would
show on

  virsh qemu-monitor-command rhel7 --hmp info cpus

If there are running, but never entering VCPUs, then running

  trace-cmd record -e kvm:\*

to capture all kvm events could tell us why it is stuck.

>>  - what is your CPU model?
> 
> $ cat /proc/cpuinfo
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> 
> (this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
> testing, to avoid having to reboot my primary desktop :)

Oh, ancient when it comes to VMX. :)

>>  - what do you get after running this C program on host and guest?
> 
> eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020

Your CPU has the CPUID5_ECX_INTERRUPT_BREAK, thanks.
The guest should see QEMU's default, which is

  eax=0x000000 ebx=0x000000 ecx=0x000003 edx=0x000000

Thanks.
Gabriel L. Somlo March 15, 2017, 11:22 p.m. UTC | #5
On Wed, Mar 15, 2017 at 09:46:18PM +0100, Radim Krčmář wrote:
> 2017-03-15 16:21-0400, Gabriel L. Somlo:
> > On Wed, Mar 15, 2017 at 09:13:49PM +0100, Radim Krčmář wrote:
> >> 2017-03-15 21:28+0200, Michael S. Tsirkin:
> >> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> >> > unless explicitly provided with kernel command line argument
> >> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> >> > without checking CPUID.
> >> > 
> >> > We currently emulate that as a NOP but on VMX we can do better: let
> >> > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> >> > but that isn't any worse than a NOP emulation.
> >> > 
> >> > Note that mwait within guests is not the same as on real hardware
> >> > because halt causes an exit while mwait doesn't.  For this reason it
> >> > might not be a good idea to use the regular MWAIT flag in CPUID to
> >> > signal this capability.  Add a flag in the hypervisor leaf instead.
> >> > 
> >> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> >> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> >> > to improve guest behaviour.
> >> > 
> >> > Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> >> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> > ---
> >> > 
> >> > Note: SVM bits are untested at this point. Seems pretty
> >> > obvious though.
> >> > 
> >> > changes from v3:
> >> > - don't enable capability if cli+mwait blocks interrupts
> >> > - doc typo fixes (drop drop ppc doc)
> >> > 
> >> > changes from v2:
> >> > - add a capability to allow host userspace to detect new kernels
> >> > - more documentation to clarify the semantics of the feature flag
> >> >   and why it's useful
> >> > - svm support as suggested by Radim
> >> > 
> >> > changes from v1:
> >> > - typo fix resulting in rest of leaf flags being overwritten
> >> >   Reported by: Wanpeng Li <kernellwp@gmail.com>
> >> > - updated commit log with data about guests helped by this feature
> >> > - better document differences between mwait and halt for guests
> >> > 
> >> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> >> > @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> >> >  	    __rem;						\
> >> >  	 })
> >> >  
> >> > +static bool kvm_mwait_in_guest(void)
> >> > +{
> >> > +	unsigned int eax, ebx, ecx;
> >> > +
> >> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> >> > +		return -ENODEV;
> >> > +
> >> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> >> > +		return -ENODEV;
> >> > +
> >> > +	/*
> >> > +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> >> > +	 * they would allow guest to stop the CPU completely by disabling
> >> > +	 * interrupts then invoking MWAIT.
> >> > +	 */
> >> > +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> >> > +		return -ENODEV;
> >> > +
> >> > +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
> >> > +
> >> > +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> >> > +		return -ENODEV;
> >> 
> >> The guest is still able to set ecx=0 with MWAIT, which should be the
> >> same as not having the CPUID flag, so I'm wondering how this check
> >> prevents anything harmful ... is it really a cpu "feature"?
> >> 
> >> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
> >> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
> >> hang that Gabriel is hitting.
> >> 
> >> Gabriel,
> >> 
> >>  - do you see VM exits on the "hung" VCPU?
> > 
> > how would I go about looking ?
> 
> Probably the simplest would be to try install trace-cmd and do
> 
>   trace-cmd record -e kvm:kvm_entry
> 
> for a while, then killing it with ^C and calling
> 
>   trace-cmd report
> 
> it will have lines like:
> 
>   CPU-3729  [021]  5046.222480: kvm_entry:            vcpu 7
> 
> the first column is task id, the last one is VCPU id, so you'd verify that all
> VCPUs eventually enter.
> You could either look at the taks id of VCPUs that are running 100% of the time
> or just pipe through `grep -o 'vcpu.*' | sort -u` and hope that all of them are
> running, so you'd see nice list of them all. :)

so, with Michael's patch v5 (same behavior as earlier btw), I get:

# trace-cmd report | head -10000 | grep -o 'vcpu.*' | sort -u
vcpu 0
vcpu 1
vcpu 2
vcpu 3

Which means they're all there.

Also, 'top' shows 400% CPU for qemu-system-x86_64, which means I have
4 threads idle-looping at 100% each, which is consistent with a guest
using an MWAIT-based idle loop.

> If you don't see some VCPUs there, they might have been halted, which would
> show on
> 
>   virsh qemu-monitor-command rhel7 --hmp info cpus
> 
> If there are running, but never entering VCPUs, then running
> 
>   trace-cmd record -e kvm:\*
> 
> to capture all kvm events could tell us why it is stuck.
> 
> >>  - what is your CPU model?
> > 
> > $ cat /proc/cpuinfo
> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> > 
> > (this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
> > testing, to avoid having to reboot my primary desktop :)
> 
> Oh, ancient when it comes to VMX. :)

I also have a somewhat more recent (only 4 year old) Macbook Air, I'll
try to redo the whole set of tests on that one as well.

> 
> >>  - what do you get after running this C program on host and guest?
> > 
> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020
> 
> Your CPU has the CPUID5_ECX_INTERRUPT_BREAK, thanks.
> The guest should see QEMU's default, which is
> 
>   eax=0x000000 ebx=0x000000 ecx=0x000003 edx=0x000000

That's right, I finally managed to transfer a linux guest image over
to the box, and indeed that's what it produces.

More tomorrow once I have a chance to test on the Macbook Air (need to
compile kvm git master on that one first, then transfer the os x qcow
image, etc...

Thanks,
--Gabriel
kernel test robot March 16, 2017, 7:26 a.m. UTC | #6
Hi Michael,

[auto build test ERROR on kvm/linux-next]
[also build test ERROR on v4.11-rc2 next-20170310]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/kvm-better-MWAIT-emulation-for-guests/20170316-143518
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: x86_64-randconfig-x010-201711 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   In file included from arch/x86/kvm/x86.c:28:0:
   arch/x86/kvm/x86.h: In function 'kvm_mwait_in_guest':
>> arch/x86/kvm/x86.h:231:34: error: 'CPUID_MWAIT_LEAF' undeclared (first use in this function)
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^~~~~~~~~~~~~~~~
   arch/x86/kvm/x86.h:231:34: note: each undeclared identifier is reported only once for each function it appears in
>> arch/x86/kvm/x86.h:234:45: error: 'mwait_substates' undeclared (first use in this function)
     cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
                                                ^~~~~~~~~~~~~~~
>> arch/x86/kvm/x86.h:236:14: error: 'CPUID5_ECX_INTERRUPT_BREAK' undeclared (first use in this function)
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> arch/x86/kvm/x86.h:238:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^
--
   In file included from arch/x86/kvm/mmu.c:23:0:
   arch/x86/kvm/x86.h: In function 'kvm_mwait_in_guest':
>> arch/x86/kvm/x86.h:231:34: error: 'CPUID_MWAIT_LEAF' undeclared (first use in this function)
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^~~~~~~~~~~~~~~~
   arch/x86/kvm/x86.h:231:34: note: each undeclared identifier is reported only once for each function it appears in
>> arch/x86/kvm/x86.h:234:45: error: 'mwait_substates' undeclared (first use in this function)
     cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
                                                ^~~~~~~~~~~~~~~
>> arch/x86/kvm/x86.h:236:14: error: 'CPUID5_ECX_INTERRUPT_BREAK' undeclared (first use in this function)
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
   At top level:
   arch/x86/kvm/x86.h:216:13: warning: 'kvm_mwait_in_guest' defined but not used [-Wunused-function]
    static bool kvm_mwait_in_guest(void)
                ^~~~~~~~~~~~~~~~~~

vim +/CPUID_MWAIT_LEAF +231 arch/x86/kvm/x86.h

   225	
   226		/*
   227		 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
   228		 * they would allow guest to stop the CPU completely by disabling
   229		 * interrupts then invoking MWAIT.
   230		 */
 > 231		if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
   232			return -ENODEV;
   233	
 > 234		cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
   235	
 > 236		if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
   237			return -ENODEV;
 > 238	}
   239	
   240	#endif

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
kernel test robot March 16, 2017, 7:34 a.m. UTC | #7
Hi Michael,

[auto build test WARNING on kvm/linux-next]
[also build test WARNING on v4.11-rc2 next-20170310]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/kvm-better-MWAIT-emulation-for-guests/20170316-143518
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: i386-randconfig-x018-201711 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/kvm_host.h:9,
                    from arch/x86/kvm/x86.c:22:
   arch/x86/kvm/x86.h: In function 'kvm_mwait_in_guest':
   arch/x86/kvm/x86.h:231:34: error: 'CPUID_MWAIT_LEAF' undeclared (first use in this function)
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> arch/x86/kvm/x86.h:231:2: note: in expansion of macro 'if'
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
     ^~
   arch/x86/kvm/x86.h:231:34: note: each undeclared identifier is reported only once for each function it appears in
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> arch/x86/kvm/x86.h:231:2: note: in expansion of macro 'if'
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
     ^~
   In file included from arch/x86/kvm/x86.c:28:0:
   arch/x86/kvm/x86.h:234:45: error: 'mwait_substates' undeclared (first use in this function)
     cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
                                                ^~~~~~~~~~~~~~~
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/kvm_host.h:9,
                    from arch/x86/kvm/x86.c:22:
   arch/x86/kvm/x86.h:236:14: error: 'CPUID5_ECX_INTERRUPT_BREAK' undeclared (first use in this function)
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
                 ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
   arch/x86/kvm/x86.h:236:2: note: in expansion of macro 'if'
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
     ^~
   In file included from arch/x86/kvm/x86.c:28:0:
   arch/x86/kvm/x86.h:238:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^
--
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/mm_types_task.h:10,
                    from include/linux/mm_types.h:4,
                    from arch/x86/kvm/irq.h:25,
                    from arch/x86/kvm/mmu.c:21:
   arch/x86/kvm/x86.h: In function 'kvm_mwait_in_guest':
   arch/x86/kvm/x86.h:231:34: error: 'CPUID_MWAIT_LEAF' undeclared (first use in this function)
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> arch/x86/kvm/x86.h:231:2: note: in expansion of macro 'if'
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
     ^~
   arch/x86/kvm/x86.h:231:34: note: each undeclared identifier is reported only once for each function it appears in
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
                                     ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> arch/x86/kvm/x86.h:231:2: note: in expansion of macro 'if'
     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
     ^~
   In file included from arch/x86/kvm/mmu.c:23:0:
   arch/x86/kvm/x86.h:234:45: error: 'mwait_substates' undeclared (first use in this function)
     cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
                                                ^~~~~~~~~~~~~~~
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/linux/mm_types_task.h:10,
                    from include/linux/mm_types.h:4,
                    from arch/x86/kvm/irq.h:25,
                    from arch/x86/kvm/mmu.c:21:
   arch/x86/kvm/x86.h:236:14: error: 'CPUID5_ECX_INTERRUPT_BREAK' undeclared (first use in this function)
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
                 ^
   include/linux/compiler.h:160:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
   arch/x86/kvm/x86.h:236:2: note: in expansion of macro 'if'
     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
     ^~
   In file included from arch/x86/kvm/mmu.c:23:0:
   At top level:
   arch/x86/kvm/x86.h:216:13: warning: 'kvm_mwait_in_guest' defined but not used [-Wunused-function]
    static bool kvm_mwait_in_guest(void)
                ^~~~~~~~~~~~~~~~~~

vim +/if +231 arch/x86/kvm/x86.h

   215	
   216	static bool kvm_mwait_in_guest(void)
   217	{
   218		unsigned int eax, ebx, ecx;
   219	
   220		if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
   221			return -ENODEV;
   222	
   223		if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
   224			return -ENODEV;
   225	
   226		/*
   227		 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
   228		 * they would allow guest to stop the CPU completely by disabling
   229		 * interrupts then invoking MWAIT.
   230		 */
 > 231		if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
   232			return -ENODEV;
   233	
   234		cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
   235	
   236		if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
   237			return -ENODEV;
   238	}
   239	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Wanpeng Li March 16, 2017, 9:30 a.m. UTC | #8
2017-03-16 4:13 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
> 2017-03-15 21:28+0200, Michael S. Tsirkin:
>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> unless explicitly provided with kernel command line argument
>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> without checking CPUID.
>>
>> We currently emulate that as a NOP but on VMX we can do better: let
>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>> but that isn't any worse than a NOP emulation.
>>
>> Note that mwait within guests is not the same as on real hardware
>> because halt causes an exit while mwait doesn't.  For this reason it
>> might not be a good idea to use the regular MWAIT flag in CPUID to
>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>
>> Additionally, we add a capability for QEMU - e.g. if it knows there's an
>> isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
>> to improve guest behaviour.
>>
>> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>
>> Note: SVM bits are untested at this point. Seems pretty
>> obvious though.
>>
>> changes from v3:
>> - don't enable capability if cli+mwait blocks interrupts
>> - doc typo fixes (drop drop ppc doc)
>>
>> changes from v2:
>> - add a capability to allow host userspace to detect new kernels
>> - more documentation to clarify the semantics of the feature flag
>>   and why it's useful
>> - svm support as suggested by Radim
>>
>> changes from v1:
>> - typo fix resulting in rest of leaf flags being overwritten
>>   Reported by: Wanpeng Li <kernellwp@gmail.com>
>> - updated commit log with data about guests helped by this feature
>> - better document differences between mwait and halt for guests
>>
>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>>           __rem;                                              \
>>        })
>>
>> +static bool kvm_mwait_in_guest(void)
>> +{
>> +     unsigned int eax, ebx, ecx;
>> +
>> +     if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
>> +             return -ENODEV;
>> +
>> +     if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
>> +             return -ENODEV;
>> +
>> +     /*
>> +      * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
>> +      * they would allow guest to stop the CPU completely by disabling
>> +      * interrupts then invoking MWAIT.
>> +      */
>> +     if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
>> +             return -ENODEV;
>> +
>> +     cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
>> +
>> +     if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
>> +             return -ENODEV;
>
> The guest is still able to set ecx=0 with MWAIT, which should be the

How can guest rewrite this?

Regards,
Wanpeng Li

> same as not having the CPUID flag, so I'm wondering how this check
> prevents anything harmful ... is it really a cpu "feature"?
>
> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
> hang that Gabriel is hitting.
>
> Gabriel,
>
>  - do you see VM exits on the "hung" VCPU?
>  - what is your CPU model?
>  - what do you get after running this C program on host and guest?
>
>    #include <stdint.h>
>    #include <stdio.h>
>
>    int main(void) {
>         uint32_t eax = 5, ebx, ecx = 0, edx;
>         asm ("cpuid" : "+a"(eax), "=b"(ebx), "+c"(ecx), "=d"(edx));
>
>         printf("eax=%#08x ebx=%#08x ecx=%#08x edx=%#08x\n", eax, ebx, ecx, edx);
>
>         return 0;
>    }
>
> Thanks.

Patch
diff mbox

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c248f7..6ee2e43 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4147,3 +4147,12 @@  This capability, if KVM_CHECK_EXTENSION indicates that it is
 available, means that that the kernel can support guests using the
 hashed page table MMU defined in Power ISA V3.00 (as implemented in
 the POWER9 processor), including in-memory segment tables.
+
+8.5 KVM_CAP_X86_GUEST_MWAIT
+
+Architectures: x86
+
+This capability indicates that guest using memory monotoring instructions
+(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
+spent while virtual CPU is halted in this way will then be accounted for as
+guest running time on the host (as opposed to e.g. HLT).
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..04c201c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,12 @@  KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
                                    ||       || before enabling paravirtualized
                                    ||       || spinlock support.
 ------------------------------------------------------------------------------
+KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
+                                   ||       || to halt the VCPU without exits,
+                                   ||       || time spent while halted in this
+                                   ||       || way is accounted for on host as
+                                   ||       || VCPU run time.
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..9cc77a7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -24,6 +24,7 @@ 
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
 #define KVM_FEATURE_PV_UNHALT		7
+#define KVM_FEATURE_MWAIT		8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index efde6cc..5638102 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -594,6 +594,9 @@  static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
+		if (kvm_mwait_in_guest())
+			entry->eax |= (1 << KVM_FEATURE_MWAIT);
+
 		entry->ebx = 0;
 		entry->ecx = 0;
 		entry->edx = 0;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d1efe2c..18e53bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1198,8 +1198,6 @@  static void init_vmcb(struct vcpu_svm *svm)
 	set_intercept(svm, INTERCEPT_CLGI);
 	set_intercept(svm, INTERCEPT_SKINIT);
 	set_intercept(svm, INTERCEPT_WBINVD);
-	set_intercept(svm, INTERCEPT_MONITOR);
-	set_intercept(svm, INTERCEPT_MWAIT);
 	set_intercept(svm, INTERCEPT_XSETBV);
 
 	control->iopm_base_pa = iopm_base;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 98e82ee..ea0c96a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3547,11 +3547,13 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_USE_IO_BITMAPS |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETING |
-	      CPU_BASED_MWAIT_EXITING |
-	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
 
+	if (!kvm_mwait_in_guest())
+		min |= CPU_BASED_MWAIT_EXITING |
+			CPU_BASED_MONITOR_EXITING;
+
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
 	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1faf620..8c74fff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2684,6 +2684,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ADJUST_CLOCK:
 		r = KVM_CLOCK_TSC_STABLE;
 		break;
+	case KVM_CAP_X86_GUEST_MWAIT:
+		r = kvm_mwait_in_guest();
+		break;
 	case KVM_CAP_X86_SMM:
 		/* SMBASE is usually relocated above 1M on modern chipsets,
 		 * and SMM handlers might indeed rely on 4G segment limits,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e8ff3e4..e2f6974 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -1,6 +1,7 @@ 
 #ifndef ARCH_X86_KVM_X86_H
 #define ARCH_X86_KVM_X86_H
 
+#include <asm/processor.h>
 #include <linux/kvm_host.h>
 #include <asm/pvclock.h>
 #include "kvm_cache_regs.h"
@@ -212,4 +213,28 @@  static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 	    __rem;						\
 	 })
 
+static bool kvm_mwait_in_guest(void)
+{
+	unsigned int eax, ebx, ecx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+		return -ENODEV;
+
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return -ENODEV;
+
+	/*
+	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
+	 * they would allow guest to stop the CPU completely by disabling
+	 * interrupts then invoking MWAIT.
+	 */
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return -ENODEV;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
+
+	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return -ENODEV;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d508..8b6bc06 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_X86_GUEST_MWAIT 137
 
 #ifdef KVM_CAP_IRQ_ROUTING