[v5,untested] kvm: better MWAIT emulation for guests
diff mbox

Message ID 1489612895-12799-1-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin March 15, 2017, 9:22 p.m. UTC
Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
unless explicitly provided with kernel command line argument
"idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
without checking CPUID.

We currently emulate that as a NOP but on VMX we can do better: let
guest stop the CPU until timer, IPI or memory change.  CPU will be busy
but that isn't any worse than a NOP emulation.

Note that mwait within guests is not the same as on real hardware
because halt causes an exit while mwait doesn't.  For this reason it
might not be a good idea to use the regular MWAIT flag in CPUID to
signal this capability.  Add a flag in the hypervisor leaf instead.

Additionally, we add a capability for QEMU - e.g. if it knows there's an
isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
to improve guest behaviour.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

This is for Gabriel's testing only. A bit rushed so untested.

 Documentation/virtual/kvm/api.txt    |  9 +++++++++
 Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/cpuid.c                 |  3 +++
 arch/x86/kvm/svm.c                   |  2 --
 arch/x86/kvm/vmx.c                   |  6 ++++--
 arch/x86/kvm/x86.c                   |  3 +++
 arch/x86/kvm/x86.h                   | 28 ++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  1 +
 9 files changed, 55 insertions(+), 4 deletions(-)

Comments

Gabriel L. Somlo March 15, 2017, 11:35 p.m. UTC | #1
On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> unless explicitly provided with kernel command line argument
> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> without checking CPUID.
> 
> We currently emulate that as a NOP but on VMX we can do better: let
> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> but that isn't any worse than a NOP emulation.
> 
> Note that mwait within guests is not the same as on real hardware
> because halt causes an exit while mwait doesn't.  For this reason it
> might not be a good idea to use the regular MWAIT flag in CPUID to
> signal this capability.  Add a flag in the hypervisor leaf instead.
> 
> Additionally, we add a capability for QEMU - e.g. if it knows there's an
> isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> to improve guest behaviour.

Same behavior (on the mac pro 1,1 running F22 with custom-compiled
kernel from kvm git master, plus this patch on top).

The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
on boot, does not bring up guest graphical interface within the first
10 minutes that I waited for it. That, in contrast with the default
nop-based emulation where the guest comes up within 30 seconds.

I will run another round of tests on a newer Mac (4-year-old macbook
air) and report back tomorrow.

Going off on a tangent, why would encouraging otherwise well-behaved
guests (like linux ones, for example) to use MWAIT be desirable to
begin with ? Is it a matter of minimizing the overhead associated with
exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
running guest-mode MWAIT in a tight loop will actually waste the host
CPU without the opportunity to yield to some other L0 thread. Sorry if
I fell into the middle of an ongoing conversation on this and missed
most of the relevant context, in which case please feel free to ignore
me... :)

Thanks,
--G

> 
> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> This is for Gabriel's testing only. A bit rushed so untested.
> 
>  Documentation/virtual/kvm/api.txt    |  9 +++++++++
>  Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/cpuid.c                 |  3 +++
>  arch/x86/kvm/svm.c                   |  2 --
>  arch/x86/kvm/vmx.c                   |  6 ++++--
>  arch/x86/kvm/x86.c                   |  3 +++
>  arch/x86/kvm/x86.h                   | 28 ++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h             |  1 +
>  9 files changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 3c248f7..6ee2e43 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4147,3 +4147,12 @@ This capability, if KVM_CHECK_EXTENSION indicates that it is
>  available, means that that the kernel can support guests using the
>  hashed page table MMU defined in Power ISA V3.00 (as implemented in
>  the POWER9 processor), including in-memory segment tables.
> +
> +8.5 KVM_CAP_X86_GUEST_MWAIT
> +
> +Architectures: x86
> +
> +This capability indicates that guest using memory monotoring instructions
> +(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
> +spent while virtual CPU is halted in this way will then be accounted for as
> +guest running time on the host (as opposed to e.g. HLT).
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..04c201c 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
>                                     ||       || before enabling paravirtualized
>                                     ||       || spinlock support.
>  ------------------------------------------------------------------------------
> +KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
> +                                   ||       || to halt the VCPU without exits,
> +                                   ||       || time spent while halted in this
> +                                   ||       || way is accounted for on host as
> +                                   ||       || VCPU run time.
> +------------------------------------------------------------------------------
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
>                                     ||       || per-cpu warps are expected in
>                                     ||       || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index cff0bb6..9cc77a7 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -24,6 +24,7 @@
>  #define KVM_FEATURE_STEAL_TIME		5
>  #define KVM_FEATURE_PV_EOI		6
>  #define KVM_FEATURE_PV_UNHALT		7
> +#define KVM_FEATURE_MWAIT		8
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index efde6cc..5638102 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>  
> +		if (kvm_mwait_in_guest())
> +			entry->eax |= (1 << KVM_FEATURE_MWAIT);
> +
>  		entry->ebx = 0;
>  		entry->ecx = 0;
>  		entry->edx = 0;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d1efe2c..18e53bc 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
>  	set_intercept(svm, INTERCEPT_CLGI);
>  	set_intercept(svm, INTERCEPT_SKINIT);
>  	set_intercept(svm, INTERCEPT_WBINVD);
> -	set_intercept(svm, INTERCEPT_MONITOR);
> -	set_intercept(svm, INTERCEPT_MWAIT);
>  	set_intercept(svm, INTERCEPT_XSETBV);
>  
>  	control->iopm_base_pa = iopm_base;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 98e82ee..ea0c96a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	      CPU_BASED_USE_IO_BITMAPS |
>  	      CPU_BASED_MOV_DR_EXITING |
>  	      CPU_BASED_USE_TSC_OFFSETING |
> -	      CPU_BASED_MWAIT_EXITING |
> -	      CPU_BASED_MONITOR_EXITING |
>  	      CPU_BASED_INVLPG_EXITING |
>  	      CPU_BASED_RDPMC_EXITING;
>  
> +	if (!kvm_mwait_in_guest())
> +		min |= CPU_BASED_MWAIT_EXITING |
> +			CPU_BASED_MONITOR_EXITING;
> +
>  	opt = CPU_BASED_TPR_SHADOW |
>  	      CPU_BASED_USE_MSR_BITMAPS |
>  	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1faf620..8c74fff 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2684,6 +2684,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ADJUST_CLOCK:
>  		r = KVM_CLOCK_TSC_STABLE;
>  		break;
> +	case KVM_CAP_X86_GUEST_MWAIT:
> +		r = kvm_mwait_in_guest();
> +		break;
>  	case KVM_CAP_X86_SMM:
>  		/* SMBASE is usually relocated above 1M on modern chipsets,
>  		 * and SMM handlers might indeed rely on 4G segment limits,
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index e8ff3e4..a2d8964 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -1,6 +1,8 @@
>  #ifndef ARCH_X86_KVM_X86_H
>  #define ARCH_X86_KVM_X86_H
>  
> +#include <asm/processor.h>
> +#include <asm/mwait.h>
>  #include <linux/kvm_host.h>
>  #include <asm/pvclock.h>
>  #include "kvm_cache_regs.h"
> @@ -212,4 +214,30 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>  	    __rem;						\
>  	 })
>  
> +static inline bool kvm_mwait_in_guest(void)
> +{
> +	unsigned int eax, ebx, ecx, edx;
> +
> +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> +		return false;
> +
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +		return false;
> +
> +	/*
> +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> +	 * they would allow guest to stop the CPU completely by disabling
> +	 * interrupts then invoking MWAIT.
> +	 */
> +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> +		return false;
> +
> +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> +
> +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +		return false;
> +
> +	return true;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f51d508..8b6bc06 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_MMU_RADIX 134
>  #define KVM_CAP_PPC_MMU_HASH_V3 135
>  #define KVM_CAP_IMMEDIATE_EXIT 136
> +#define KVM_CAP_X86_GUEST_MWAIT 137
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> -- 
> MST
Michael S. Tsirkin March 15, 2017, 11:41 p.m. UTC | #2
On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > unless explicitly provided with kernel command line argument
> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > without checking CPUID.
> > 
> > We currently emulate that as a NOP but on VMX we can do better: let
> > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> > but that isn't any worse than a NOP emulation.
> > 
> > Note that mwait within guests is not the same as on real hardware
> > because halt causes an exit while mwait doesn't.  For this reason it
> > might not be a good idea to use the regular MWAIT flag in CPUID to
> > signal this capability.  Add a flag in the hypervisor leaf instead.
> > 
> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > to improve guest behaviour.
> 
> Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> kernel from kvm git master, plus this patch on top).
> 
> The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> on boot, does not bring up guest graphical interface within the first
> 10 minutes that I waited for it. That, in contrast with the default
> nop-based emulation where the guest comes up within 30 seconds.


Thanks a lot, meanwhile I'll try to write a unit-test and experiment
with various behaviours.

> I will run another round of tests on a newer Mac (4-year-old macbook
> air) and report back tomorrow.
> 
> Going off on a tangent, why would encouraging otherwise well-behaved
> guests (like linux ones, for example) to use MWAIT be desirable to
> begin with ? Is it a matter of minimizing the overhead associated with
> exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> running guest-mode MWAIT in a tight loop will actually waste the host
> CPU without the opportunity to yield to some other L0 thread. Sorry if
> I fell into the middle of an ongoing conversation on this and missed
> most of the relevant context, in which case please feel free to ignore
> me... :)
> 
> Thanks,
> --G

It's just some experiments I'm running, I'm not ready to describe it
yet. I thought this part might be useful to at least some guests, so
trying to upstream it right now.
Joerg Roedel March 21, 2017, 4:16 p.m. UTC | #3
On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d1efe2c..18e53bc 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
>  	set_intercept(svm, INTERCEPT_CLGI);
>  	set_intercept(svm, INTERCEPT_SKINIT);
>  	set_intercept(svm, INTERCEPT_WBINVD);
> -	set_intercept(svm, INTERCEPT_MONITOR);
> -	set_intercept(svm, INTERCEPT_MWAIT);

Why do you remove the intercepts for AMD? The new kvm_mwait_in_guest()
function will always return false on AMD anyway, and on Intel you re-add
the intercepts for !kvm_mwait_in_guest().


	Joerg
Michael S. Tsirkin March 21, 2017, 6:45 p.m. UTC | #4
On Tue, Mar 21, 2017 at 05:16:32PM +0100, Joerg Roedel wrote:
> On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index d1efe2c..18e53bc 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
> >  	set_intercept(svm, INTERCEPT_CLGI);
> >  	set_intercept(svm, INTERCEPT_SKINIT);
> >  	set_intercept(svm, INTERCEPT_WBINVD);
> > -	set_intercept(svm, INTERCEPT_MONITOR);
> > -	set_intercept(svm, INTERCEPT_MWAIT);
> 
> Why do you remove the intercepts for AMD? The new kvm_mwait_in_guest()
> function will always return false on AMD anyway,

I think that's a bug and I should fix it to return true there.

> and on Intel you re-add
> the intercepts for !kvm_mwait_in_guest().
> 
> 
> 	Joerg

Does AMD need some work-around similar to CPUID5_ECX_INTERRUPT_BREAK?
That's why we have kvm_mwait_in_guest ...
Alexander Graf March 27, 2017, 1:34 p.m. UTC | #5
On 15/03/2017 22:22, Michael S. Tsirkin wrote:
> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> unless explicitly provided with kernel command line argument
> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> without checking CPUID.
>
> We currently emulate that as a NOP but on VMX we can do better: let
> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> but that isn't any worse than a NOP emulation.
>
> Note that mwait within guests is not the same as on real hardware
> because halt causes an exit while mwait doesn't.  For this reason it
> might not be a good idea to use the regular MWAIT flag in CPUID to
> signal this capability.  Add a flag in the hypervisor leaf instead.

So imagine we had proper MWAIT emulation capabilities based on page 
faults. In that case, we could do something as fancy as

Treat MWAIT as pass-through by default

Have a per-vcpu monitor timer 10 times a second in the background that 
checks which instruction we're in

If we're in mwait for the last - say - 1 second, switch to emulated 
MWAIT, if $IP was in non-mwait within that time, reset counter.

Or instead maybe just reuse the adapter hlt logic?

Either way, with that we should be able to get super low latency IPIs 
running while still maintaining some sanity on systems which don't have 
dedicated CPUs for workloads.

And we wouldn't need guest modifications, which is a great plus. So 
older guests (and Windows?) could benefit from mwait as well.


Alex
Radim Krčmář March 28, 2017, 2:28 p.m. UTC | #6
2017-03-27 15:34+0200, Alexander Graf:
> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> unless explicitly provided with kernel command line argument
>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> without checking CPUID.
>> 
>> We currently emulate that as a NOP but on VMX we can do better: let
>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>> but that isn't any worse than a NOP emulation.
>> 
>> Note that mwait within guests is not the same as on real hardware
>> because halt causes an exit while mwait doesn't.  For this reason it
>> might not be a good idea to use the regular MWAIT flag in CPUID to
>> signal this capability.  Add a flag in the hypervisor leaf instead.
> 
> So imagine we had proper MWAIT emulation capabilities based on page faults.
> In that case, we could do something as fancy as
> 
> Treat MWAIT as pass-through by default
> 
> Have a per-vcpu monitor timer 10 times a second in the background that
> checks which instruction we're in
> 
> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
> if $IP was in non-mwait within that time, reset counter.

Or we could reuse external interrupts for sampling.  Exits trigerred by
them would check for current instruction (probably would be best to
limit just to timer tick) and a sufficient ratio (> 0?) of other exits
would imply that MWAIT is not used.

> Or instead maybe just reuse the adapter hlt logic?

Emulated MWAIT is very similar to emulated HLT, so reusing the logic
makes sense.  We would just add new wakeup methods.

> Either way, with that we should be able to get super low latency IPIs
> running while still maintaining some sanity on systems which don't have
> dedicated CPUs for workloads.
> 
> And we wouldn't need guest modifications, which is a great plus. So older
> guests (and Windows?) could benefit from mwait as well.

There is no need guest modifications -- it could be exposed as standard
MWAIT feature to the guest, with responsibilities for guest/host-impact
on the user.

I think that the page-fault based MWAIT would require paravirt if it
should be enabled by default, because of performance concerns:
Enabling write protection on a page needs a VM exit on all other VCPUs
when beginning monitoring (to reload page permissions and prevent missed
writes).
We'd want to keep trapping writes to the page all the time because
toggling is slow, but this could regress performance for an OS that has
other data accessed by other VCPUs in that page.
No current interface can tell the guest that it should reserve the whole
page instead of what CPUID[5] says and that writes to the monitored page
are not "cheap", but can trigger a VM exit ...

And before we disable MWAIT exiting by default, we also have to
understand the old OS X on core 2 bug from Gabriel.
Jim Mattson March 28, 2017, 8:35 p.m. UTC | #7
On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2017-03-27 15:34+0200, Alexander Graf:
>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>> unless explicitly provided with kernel command line argument
>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>> without checking CPUID.
>>>
>>> We currently emulate that as a NOP but on VMX we can do better: let
>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>> but that isn't any worse than a NOP emulation.
>>>
>>> Note that mwait within guests is not the same as on real hardware
>>> because halt causes an exit while mwait doesn't.  For this reason it
>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>
>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>> In that case, we could do something as fancy as
>>
>> Treat MWAIT as pass-through by default
>>
>> Have a per-vcpu monitor timer 10 times a second in the background that
>> checks which instruction we're in
>>
>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>> if $IP was in non-mwait within that time, reset counter.
>
> Or we could reuse external interrupts for sampling.  Exits trigerred by
> them would check for current instruction (probably would be best to
> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
> would imply that MWAIT is not used.
>
>> Or instead maybe just reuse the adapter hlt logic?
>
> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
> makes sense.  We would just add new wakeup methods.
>
>> Either way, with that we should be able to get super low latency IPIs
>> running while still maintaining some sanity on systems which don't have
>> dedicated CPUs for workloads.
>>
>> And we wouldn't need guest modifications, which is a great plus. So older
>> guests (and Windows?) could benefit from mwait as well.
>
> There is no need guest modifications -- it could be exposed as standard
> MWAIT feature to the guest, with responsibilities for guest/host-impact
> on the user.
>
> I think that the page-fault based MWAIT would require paravirt if it
> should be enabled by default, because of performance concerns:
> Enabling write protection on a page needs a VM exit on all other VCPUs
> when beginning monitoring (to reload page permissions and prevent missed
> writes).
> We'd want to keep trapping writes to the page all the time because
> toggling is slow, but this could regress performance for an OS that has
> other data accessed by other VCPUs in that page.
> No current interface can tell the guest that it should reserve the whole
> page instead of what CPUID[5] says and that writes to the monitored page
> are not "cheap", but can trigger a VM exit ...

CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
when running Mac OS X guests. Per Intel's SDM volume 3, section
8.10.5, "To avoid false wake-ups; use the largest monitor line size to
pad the data structure used to monitor writes. Software must make sure
that beyond the data structure, no unrelated data variable exists in
the triggering area for MWAIT. A pad may be needed to avoid this
situation." Unfortunately, most operating systems do not follow this
advice.

>
> And before we disable MWAIT exiting by default, we also have to
> understand the old OS X on core 2 bug from Gabriel.
Radim Krčmář March 29, 2017, 12:11 p.m. UTC | #8
2017-03-28 13:35-0700, Jim Mattson:
> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>> 2017-03-27 15:34+0200, Alexander Graf:
>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>> unless explicitly provided with kernel command line argument
>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>> without checking CPUID.
>>>>
>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>> but that isn't any worse than a NOP emulation.
>>>>
>>>> Note that mwait within guests is not the same as on real hardware
>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>
>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>> In that case, we could do something as fancy as
>>>
>>> Treat MWAIT as pass-through by default
>>>
>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>> checks which instruction we're in
>>>
>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>> if $IP was in non-mwait within that time, reset counter.
>>
>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>> them would check for current instruction (probably would be best to
>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> would imply that MWAIT is not used.
>>
>>> Or instead maybe just reuse the adapter hlt logic?
>>
>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> makes sense.  We would just add new wakeup methods.
>>
>>> Either way, with that we should be able to get super low latency IPIs
>>> running while still maintaining some sanity on systems which don't have
>>> dedicated CPUs for workloads.
>>>
>>> And we wouldn't need guest modifications, which is a great plus. So older
>>> guests (and Windows?) could benefit from mwait as well.
>>
>> There is no need guest modifications -- it could be exposed as standard
>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>> on the user.
>>
>> I think that the page-fault based MWAIT would require paravirt if it
>> should be enabled by default, because of performance concerns:
>> Enabling write protection on a page needs a VM exit on all other VCPUs
>> when beginning monitoring (to reload page permissions and prevent missed
>> writes).
>> We'd want to keep trapping writes to the page all the time because
>> toggling is slow, but this could regress performance for an OS that has
>> other data accessed by other VCPUs in that page.
>> No current interface can tell the guest that it should reserve the whole
>> page instead of what CPUID[5] says and that writes to the monitored page
>> are not "cheap", but can trigger a VM exit ...
> 
> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
> when running Mac OS X guests. Per Intel's SDM volume 3, section
> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
> pad the data structure used to monitor writes. Software must make sure
> that beyond the data structure, no unrelated data variable exists in
> the triggering area for MWAIT. A pad may be needed to avoid this
> situation." Unfortunately, most operating systems do not follow this
> advice.

Right, EBX provides what we need to expose that the whole page is
monitored, thanks!

>             Unfortunately, most operating systems do not follow this
> advice.

Yeah ... KVM could add yet another heuristic to drop MWAIT emulation and
use hardware if there were many traps while the target was not MWAITING,
it's getting over-complicated, though :/
Alexander Graf April 3, 2017, 10:04 a.m. UTC | #9
On 03/29/2017 02:11 PM, Radim Krčmář wrote:
> 2017-03-28 13:35-0700, Jim Mattson:
>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>> unless explicitly provided with kernel command line argument
>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>> without checking CPUID.
>>>>>
>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>>> but that isn't any worse than a NOP emulation.
>>>>>
>>>>> Note that mwait within guests is not the same as on real hardware
>>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>> In that case, we could do something as fancy as
>>>>
>>>> Treat MWAIT as pass-through by default
>>>>
>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>> checks which instruction we're in
>>>>
>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>> if $IP was in non-mwait within that time, reset counter.
>>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>>> them would check for current instruction (probably would be best to
>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>> would imply that MWAIT is not used.
>>>
>>>> Or instead maybe just reuse the adapter hlt logic?
>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>> makes sense.  We would just add new wakeup methods.
>>>
>>>> Either way, with that we should be able to get super low latency IPIs
>>>> running while still maintaining some sanity on systems which don't have
>>>> dedicated CPUs for workloads.
>>>>
>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>> guests (and Windows?) could benefit from mwait as well.
>>> There is no need guest modifications -- it could be exposed as standard
>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>> on the user.
>>>
>>> I think that the page-fault based MWAIT would require paravirt if it
>>> should be enabled by default, because of performance concerns:
>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>> when beginning monitoring (to reload page permissions and prevent missed
>>> writes).
>>> We'd want to keep trapping writes to the page all the time because
>>> toggling is slow, but this could regress performance for an OS that has
>>> other data accessed by other VCPUs in that page.
>>> No current interface can tell the guest that it should reserve the whole
>>> page instead of what CPUID[5] says and that writes to the monitored page
>>> are not "cheap", but can trigger a VM exit ...
>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> pad the data structure used to monitor writes. Software must make sure
>> that beyond the data structure, no unrelated data variable exists in
>> the triggering area for MWAIT. A pad may be needed to avoid this
>> situation." Unfortunately, most operating systems do not follow this
>> advice.
> Right, EBX provides what we need to expose that the whole page is
> monitored, thanks!

So coming back to the original patch, is there anything that should keep 
us from exposing MWAIT straight into the guest at all times?


Alex
Radim Krčmář April 4, 2017, 12:39 p.m. UTC | #10
2017-04-03 12:04+0200, Alexander Graf:
> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>> 2017-03-28 13:35-0700, Jim Mattson:
>> > On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>> > > 2017-03-27 15:34+0200, Alexander Graf:
>> > > > On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> > > > > unless explicitly provided with kernel command line argument
>> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> > > > > without checking CPUID.
>> > > > > 
>> > > > > We currently emulate that as a NOP but on VMX we can do better: let
>> > > > > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>> > > > > but that isn't any worse than a NOP emulation.
>> > > > > 
>> > > > > Note that mwait within guests is not the same as on real hardware
>> > > > > because halt causes an exit while mwait doesn't.  For this reason it
>> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
>> > > > > signal this capability.  Add a flag in the hypervisor leaf instead.
>> > > > So imagine we had proper MWAIT emulation capabilities based on page faults.
>> > > > In that case, we could do something as fancy as
>> > > > 
>> > > > Treat MWAIT as pass-through by default
>> > > > 
>> > > > Have a per-vcpu monitor timer 10 times a second in the background that
>> > > > checks which instruction we're in
>> > > > 
>> > > > If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>> > > > if $IP was in non-mwait within that time, reset counter.
>> > > Or we could reuse external interrupts for sampling.  Exits trigerred by
>> > > them would check for current instruction (probably would be best to
>> > > limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> > > would imply that MWAIT is not used.
>> > > 
>> > > > Or instead maybe just reuse the adapter hlt logic?
>> > > Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> > > makes sense.  We would just add new wakeup methods.
>> > > 
>> > > > Either way, with that we should be able to get super low latency IPIs
>> > > > running while still maintaining some sanity on systems which don't have
>> > > > dedicated CPUs for workloads.
>> > > > 
>> > > > And we wouldn't need guest modifications, which is a great plus. So older
>> > > > guests (and Windows?) could benefit from mwait as well.
>> > > There is no need guest modifications -- it could be exposed as standard
>> > > MWAIT feature to the guest, with responsibilities for guest/host-impact
>> > > on the user.
>> > > 
>> > > I think that the page-fault based MWAIT would require paravirt if it
>> > > should be enabled by default, because of performance concerns:
>> > > Enabling write protection on a page needs a VM exit on all other VCPUs
>> > > when beginning monitoring (to reload page permissions and prevent missed
>> > > writes).
>> > > We'd want to keep trapping writes to the page all the time because
>> > > toggling is slow, but this could regress performance for an OS that has
>> > > other data accessed by other VCPUs in that page.
>> > > No current interface can tell the guest that it should reserve the whole
>> > > page instead of what CPUID[5] says and that writes to the monitored page
>> > > are not "cheap", but can trigger a VM exit ...
>> > CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> > VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> > when running Mac OS X guests. Per Intel's SDM volume 3, section
>> > 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> > pad the data structure used to monitor writes. Software must make sure
>> > that beyond the data structure, no unrelated data variable exists in
>> > the triggering area for MWAIT. A pad may be needed to avoid this
>> > situation." Unfortunately, most operating systems do not follow this
>> > advice.
>> Right, EBX provides what we need to expose that the whole page is
>> monitored, thanks!
> 
> So coming back to the original patch, is there anything that should keep us
> from exposing MWAIT straight into the guest at all times?

Just minor issues:
 * OS X on Core 2 fails for unknown reason if we disable the instruction
   trapping, which is an argument against doing it by default
 * idling guests would consume host CPU, which is a significant change
   in behavior and shouldn't be done without userspace's involvement

I think the best compromise is to add a capability for the MWAIT VM-exit
controls and let userspace expose MWAIT if it wishes to.
Will send a patch.
Alexander Graf April 4, 2017, 12:51 p.m. UTC | #11
On 04/04/2017 02:39 PM, Radim Krčmář wrote:
> 2017-04-03 12:04+0200, Alexander Graf:
>> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>>> 2017-03-28 13:35-0700, Jim Mattson:
>>>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>>>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>>>> unless explicitly provided with kernel command line argument
>>>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>>>> without checking CPUID.
>>>>>>>
>>>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>>>>> but that isn't any worse than a NOP emulation.
>>>>>>>
>>>>>>> Note that mwait within guests is not the same as on real hardware
>>>>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>>>> In that case, we could do something as fancy as
>>>>>>
>>>>>> Treat MWAIT as pass-through by default
>>>>>>
>>>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>>>> checks which instruction we're in
>>>>>>
>>>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>>>> if $IP was in non-mwait within that time, reset counter.
>>>>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>>>>> them would check for current instruction (probably would be best to
>>>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>>>> would imply that MWAIT is not used.
>>>>>
>>>>>> Or instead maybe just reuse the adapter hlt logic?
>>>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>>>> makes sense.  We would just add new wakeup methods.
>>>>>
>>>>>> Either way, with that we should be able to get super low latency IPIs
>>>>>> running while still maintaining some sanity on systems which don't have
>>>>>> dedicated CPUs for workloads.
>>>>>>
>>>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>>>> guests (and Windows?) could benefit from mwait as well.
>>>>> There is no need guest modifications -- it could be exposed as standard
>>>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>>>> on the user.
>>>>>
>>>>> I think that the page-fault based MWAIT would require paravirt if it
>>>>> should be enabled by default, because of performance concerns:
>>>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>>>> when beginning monitoring (to reload page permissions and prevent missed
>>>>> writes).
>>>>> We'd want to keep trapping writes to the page all the time because
>>>>> toggling is slow, but this could regress performance for an OS that has
>>>>> other data accessed by other VCPUs in that page.
>>>>> No current interface can tell the guest that it should reserve the whole
>>>>> page instead of what CPUID[5] says and that writes to the monitored page
>>>>> are not "cheap", but can trigger a VM exit ...
>>>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>>>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>>>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>>>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>>>> pad the data structure used to monitor writes. Software must make sure
>>>> that beyond the data structure, no unrelated data variable exists in
>>>> the triggering area for MWAIT. A pad may be needed to avoid this
>>>> situation." Unfortunately, most operating systems do not follow this
>>>> advice.
>>> Right, EBX provides what we need to expose that the whole page is
>>> monitored, thanks!
>> So coming back to the original patch, is there anything that should keep us
>> from exposing MWAIT straight into the guest at all times?
> Just minor issues:
>   * OS X on Core 2 fails for unknown reason if we disable the instruction
>     trapping, which is an argument against doing it by default

So for that we should try and see if changing the exposed CPUID MWAIT 
leaf helps. Currently we return 0/0 which is pretty bogus and might be 
the reason OSX fails.

>   * idling guests would consume host CPU, which is a significant change
>     in behavior and shouldn't be done without userspace's involvement

That's the same as today, as idling guests with MWAIT would also today 
end up in a NOP emulated loop.

Please bear in mind that I do not advocate to expose the MWAIT CPUID 
flag. This is only for the instruction trap.

> I think the best compromise is to add a capability for the MWAIT VM-exit
> controls and let userspace expose MWAIT if it wishes to.
> Will send a patch.


Please see my patch to force enable CPUID bits ;).



Alex
Radim Krčmář April 4, 2017, 1:13 p.m. UTC | #12
2017-04-04 14:51+0200, Alexander Graf:
> On 04/04/2017 02:39 PM, Radim Krčmář wrote:
>> 2017-04-03 12:04+0200, Alexander Graf:
>> > So coming back to the original patch, is there anything that should keep us
>> > from exposing MWAIT straight into the guest at all times?
>> Just minor issues:
>>   * OS X on Core 2 fails for unknown reason if we disable the instruction
>>     trapping, which is an argument against doing it by default
> 
> So for that we should try and see if changing the exposed CPUID MWAIT leaf
> helps. Currently we return 0/0 which is pretty bogus and might be the reason
> OSX fails.

We have tried to pass host's CPUID MWAIT leaf and it still failed:
https://www.spinics.net/lists/kvm/msg146686.html

I wouldn't mind breaking that particular combination of OS X and
hardware, but I'm worried to do it because we don't understand why it
broke, so there could be more ...

>>   * idling guests would consume host CPU, which is a significant change
>>     in behavior and shouldn't be done without userspace's involvement
> 
> That's the same as today, as idling guests with MWAIT would also today end
> up in a NOP emulated loop.
> 
> Please bear in mind that I do not advocate to expose the MWAIT CPUID flag.
> This is only for the instruction trap.

Ah, makes sense.

>> I think the best compromise is to add a capability for the MWAIT VM-exit
>> controls and let userspace expose MWAIT if it wishes to.
>> Will send a patch.
> 
> Please see my patch to force enable CPUID bits ;).

Nice.  MWAIT could also use setting of arbitrary values for its leaf,
but a generic interface for that would probably look clunky on the
command line ...
Alexander Graf April 4, 2017, 1:15 p.m. UTC | #13
On 04/04/2017 03:13 PM, Radim Krčmář wrote:
> 2017-04-04 14:51+0200, Alexander Graf:
>> On 04/04/2017 02:39 PM, Radim Krčmář wrote:
>>> 2017-04-03 12:04+0200, Alexander Graf:
>>>> So coming back to the original patch, is there anything that should keep us
>>>> from exposing MWAIT straight into the guest at all times?
>>> Just minor issues:
>>>    * OS X on Core 2 fails for unknown reason if we disable the instruction
>>>      trapping, which is an argument against doing it by default
>> So for that we should try and see if changing the exposed CPUID MWAIT leaf
>> helps. Currently we return 0/0 which is pretty bogus and might be the reason
>> OSX fails.
> We have tried to pass host's CPUID MWAIT leaf and it still failed:
> https://www.spinics.net/lists/kvm/msg146686.html
>
> I wouldn't mind breaking that particular combination of OS X and
> hardware, but I'm worried to do it because we don't understand why it
> broke, so there could be more ...
>
>>>    * idling guests would consume host CPU, which is a significant change
>>>      in behavior and shouldn't be done without userspace's involvement
>> That's the same as today, as idling guests with MWAIT would also today end
>> up in a NOP emulated loop.
>>
>> Please bear in mind that I do not advocate to expose the MWAIT CPUID flag.
>> This is only for the instruction trap.
> Ah, makes sense.
>
>>> I think the best compromise is to add a capability for the MWAIT VM-exit
>>> controls and let userspace expose MWAIT if it wishes to.
>>> Will send a patch.
>> Please see my patch to force enable CPUID bits ;).
> Nice.  MWAIT could also use setting of arbitrary values for its leaf,
> but a generic interface for that would probably look clunky on the
> command line ...


I think we should have an interface similar to smbios for that 
eventually. Something where you can explicitly set arbitrary CPUID leaf 
information using leaf specific syntax. There are more leafs where it 
would make sense - cache topology for example.


Alex
Radim Krčmář April 4, 2017, 1:44 p.m. UTC | #14
[Cc qemu-devel as we've gone off-topic]

2017-04-04 15:15+0200, Alexander Graf:
> On 04/04/2017 03:13 PM, Radim Krčmář wrote:
>> 2017-04-04 14:51+0200, Alexander Graf:
>> > Please see my patch to force enable CPUID bits ;).
>> Nice.  MWAIT could also use setting of arbitrary values for its leaf,
>> but a generic interface for that would probably look clunky on the
>> command line ...
> 
> 
> I think we should have an interface similar to smbios for that eventually.
> Something where you can explicitly set arbitrary CPUID leaf information
> using leaf specific syntax. There are more leafs where it would make sense -
> cache topology for example.

Right, separating cpuid from -cpu makes it bearable, like

  -cpuid leaf=%x[,subleaf=%x][,eax=%x][,ebx=%x][,ecx=%x][,edx=%x]

And Having multiple interfaces for the same thing would result in some
corner case decisions ...
I think QEMU should check that feature flags specified flags specified
by -cpu are not cleared by -cpuid.
I'm not sure if setters like "|=" and "&=~" would be beneficial in some
cases.

Patch
diff mbox

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c248f7..6ee2e43 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4147,3 +4147,12 @@  This capability, if KVM_CHECK_EXTENSION indicates that it is
 available, means that that the kernel can support guests using the
 hashed page table MMU defined in Power ISA V3.00 (as implemented in
 the POWER9 processor), including in-memory segment tables.
+
+8.5 KVM_CAP_X86_GUEST_MWAIT
+
+Architectures: x86
+
+This capability indicates that guest using memory monotoring instructions
+(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
+spent while virtual CPU is halted in this way will then be accounted for as
+guest running time on the host (as opposed to e.g. HLT).
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..04c201c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,12 @@  KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
                                    ||       || before enabling paravirtualized
                                    ||       || spinlock support.
 ------------------------------------------------------------------------------
+KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
+                                   ||       || to halt the VCPU without exits,
+                                   ||       || time spent while halted in this
+                                   ||       || way is accounted for on host as
+                                   ||       || VCPU run time.
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..9cc77a7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -24,6 +24,7 @@ 
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
 #define KVM_FEATURE_PV_UNHALT		7
+#define KVM_FEATURE_MWAIT		8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index efde6cc..5638102 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -594,6 +594,9 @@  static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
+		if (kvm_mwait_in_guest())
+			entry->eax |= (1 << KVM_FEATURE_MWAIT);
+
 		entry->ebx = 0;
 		entry->ecx = 0;
 		entry->edx = 0;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d1efe2c..18e53bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1198,8 +1198,6 @@  static void init_vmcb(struct vcpu_svm *svm)
 	set_intercept(svm, INTERCEPT_CLGI);
 	set_intercept(svm, INTERCEPT_SKINIT);
 	set_intercept(svm, INTERCEPT_WBINVD);
-	set_intercept(svm, INTERCEPT_MONITOR);
-	set_intercept(svm, INTERCEPT_MWAIT);
 	set_intercept(svm, INTERCEPT_XSETBV);
 
 	control->iopm_base_pa = iopm_base;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 98e82ee..ea0c96a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3547,11 +3547,13 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_USE_IO_BITMAPS |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETING |
-	      CPU_BASED_MWAIT_EXITING |
-	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
 
+	if (!kvm_mwait_in_guest())
+		min |= CPU_BASED_MWAIT_EXITING |
+			CPU_BASED_MONITOR_EXITING;
+
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
 	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1faf620..8c74fff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2684,6 +2684,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ADJUST_CLOCK:
 		r = KVM_CLOCK_TSC_STABLE;
 		break;
+	case KVM_CAP_X86_GUEST_MWAIT:
+		r = kvm_mwait_in_guest();
+		break;
 	case KVM_CAP_X86_SMM:
 		/* SMBASE is usually relocated above 1M on modern chipsets,
 		 * and SMM handlers might indeed rely on 4G segment limits,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e8ff3e4..a2d8964 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -1,6 +1,8 @@ 
 #ifndef ARCH_X86_KVM_X86_H
 #define ARCH_X86_KVM_X86_H
 
+#include <asm/processor.h>
+#include <asm/mwait.h>
 #include <linux/kvm_host.h>
 #include <asm/pvclock.h>
 #include "kvm_cache_regs.h"
@@ -212,4 +214,30 @@  static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 	    __rem;						\
 	 })
 
+static inline bool kvm_mwait_in_guest(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+		return false;
+
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return false;
+
+	/*
+	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
+	 * they would allow guest to stop the CPU completely by disabling
+	 * interrupts then invoking MWAIT.
+	 */
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return false;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return false;
+
+	return true;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d508..8b6bc06 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_X86_GUEST_MWAIT 137
 
 #ifdef KVM_CAP_IRQ_ROUTING