diff mbox

[v3,7/7] x86/kvm: use Enlightened VMCS when running on Hyper-V

Message ID 20180309140249.2840-8-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Vitaly Kuznetsov March 9, 2018, 2:02 p.m. UTC
Enlightened VMCS is just a structure in memory, the main benefit
besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
mask: we tell the underlying hypervisor which fields were modified
since VMEXIT so there's no need to inspect them all.

Tight CPUID loop test shows significant speedup:
Before: 18890 cycles
After: 8304 cycles

Static key is being used to avoid performance penalty for non-Hyper-V
deployments. Tests show we add around 3 (three) CPU cycles on each
VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
but I don't see a clean way to use static key in assembly.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
Changes since v2:
- define KVM_EVMCS_VERSION [Radim Krčmář]
- WARN_ONCE in get_evmcs_offset[,_cf] [Radim Krčmář]
- add evmcs_sanitize_exec_ctrls() and use it in hardware_setup() and
  dump_vmcs() [Radim Krčmář]
---
 arch/x86/kvm/vmx.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 615 insertions(+), 10 deletions(-)

Comments

Thomas Gleixner March 9, 2018, 2:08 p.m. UTC | #1
On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:

> Enlightened VMCS is just a structure in memory, the main benefit
> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
> mask: we tell the underlying hypervisor which fields were modified
> since VMEXIT so there's no need to inspect them all.
> 
> Tight CPUID loop test shows significant speedup:
> Before: 18890 cycles
> After: 8304 cycles
> 
> Static key is being used to avoid performance penalty for non-Hyper-V
> deployments. Tests show we add around 3 (three) CPU cycles on each
> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> but I don't see a clean way to use static key in assembly.

STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.

Thanks,

	tglx
Vitaly Kuznetsov March 12, 2018, 2:19 p.m. UTC | #2
Thomas Gleixner <tglx@linutronix.de> writes:

> On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:
>
>> Enlightened VMCS is just a structure in memory, the main benefit
>> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
>> mask: we tell the underlying hypervisor which fields were modified
>> since VMEXIT so there's no need to inspect them all.
>> 
>> Tight CPUID loop test shows significant speedup:
>> Before: 18890 cycles
>> After: 8304 cycles
>> 
>> Static key is being used to avoid performance penalty for non-Hyper-V
>> deployments. Tests show we add around 3 (three) CPU cycles on each
>> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
>> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
>> but I don't see a clean way to use static key in assembly.
>
> STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
>

Thanks for the tip,

with a single kernel user of these APIs it was easy to miss :-)

Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
(afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
nice if we can make them behave the same way static_branch_likely() and
friends do: compile into something else when !HAVE_JUMP_LABEL so we can
avoid nasty #ifdefs in C code.

That said I'd like to defer the question to KVM maintainers: Paolo,
Radim, what would you like me to do? Use STATIC_JUMP_IF_TRUE/FALSE as
they are, try to make them work for !HAVE_JUMP_LABEL and use them or
maybe we can commit the series as-is and have it as a future
optimization (e.g. when HAVE_JUMP_LABEL becomes mandatory)?
Radim Krčmář March 13, 2018, 7:12 p.m. UTC | #3
2018-03-12 15:19+0100, Vitaly Kuznetsov:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:
> >
> >> Enlightened VMCS is just a structure in memory, the main benefit
> >> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
> >> mask: we tell the underlying hypervisor which fields were modified
> >> since VMEXIT so there's no need to inspect them all.
> >> 
> >> Tight CPUID loop test shows significant speedup:
> >> Before: 18890 cycles
> >> After: 8304 cycles
> >> 
> >> Static key is being used to avoid performance penalty for non-Hyper-V
> >> deployments. Tests show we add around 3 (three) CPU cycles on each
> >> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> >> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> >> but I don't see a clean way to use static key in assembly.
> >
> > STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
> >
> 
> Thanks for the tip,
> 
> with a single kernel user of these APIs it was easy to miss :-)

Indeed, I had no idea.

> Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
> (afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
> nice if we can make them behave the same way static_branch_likely() and
> friends do: compile into something else when !HAVE_JUMP_LABEL so we can
> avoid nasty #ifdefs in C code.
> 
> That said I'd like to defer the question to KVM maintainers: Paolo,
> Radim, what would you like me to do? Use STATIC_JUMP_IF_TRUE/FALSE as
> they are, try to make them work for !HAVE_JUMP_LABEL and use them or
> maybe we can commit the series as-is and have it as a future
> optimization (e.g. when HAVE_JUMP_LABEL becomes mandatory)?

Please take a look into making a macro that uses STATIC_JUMP_IF_FALSE or
reads the value from provided static_key and does a test-jump, depending
on HAVE_JUMP_LABEL.
It doesn't need to be suited for general use, just something that moves
the ugliness away from vmx_vcpu_run.
(Although having it in jump_label.h would be great. I think the main
 obstacle is clobbering of flags.)

If it were still looking horrible, I'm ok with the series as-is,

thanks.
Paolo Bonzini March 14, 2018, 2:53 p.m. UTC | #4
On 09/03/2018 15:02, Vitaly Kuznetsov wrote:
> Enlightened VMCS is just a structure in memory, the main benefit
> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
> mask: we tell the underlying hypervisor which fields were modified
> since VMEXIT so there's no need to inspect them all.
> 
> Tight CPUID loop test shows significant speedup:
> Before: 18890 cycles
> After: 8304 cycles
> 
> Static key is being used to avoid performance penalty for non-Hyper-V
> deployments. Tests show we add around 3 (three) CPU cycles on each
> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> but I don't see a clean way to use static key in assembly.

If you want to live dangerously, you can use text_poke_early to change
the vmwrite to mov.  It's just a single instruction, so it's probably
not too hard.

> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> Changes since v2:
> - define KVM_EVMCS_VERSION [Radim Krčmář]
> - WARN_ONCE in get_evmcs_offset[,_cf] [Radim Krčmář]
> - add evmcs_sanitize_exec_ctrls() and use it in hardware_setup() and
>   dump_vmcs() [Radim Krčmář]
> ---
>  arch/x86/kvm/vmx.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 615 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 051dab74e4e9..44b6efa7d54e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -53,6 +53,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/microcode.h>
>  #include <asm/nospec-branch.h>
> +#include <asm/mshyperv.h>
>  
>  #include "trace.h"
>  #include "pmu.h"
> @@ -1000,6 +1001,484 @@ static const u32 vmx_msr_index[] = {
>  	MSR_EFER, MSR_TSC_AUX, MSR_STAR,
>  };
>  
> +DEFINE_STATIC_KEY_FALSE(enable_evmcs);
> +
> +#define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs))
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +static bool __read_mostly enlightened_vmcs = true;
> +module_param(enlightened_vmcs, bool, 0444);
> +
> +#define KVM_EVMCS_VERSION 1
> +
> +#define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x)
> +#define EVMCS1_FIELD(number, name, clean_mask)[ROL16(number, 6)] = \
> +		(u32)EVMCS1_OFFSET(name) | ((u32)clean_mask << 16)
> +
> +/*
> + * Lower 16 bits encode offset of the field in struct hv_enlightened_vmcs,
> + * upped 16 bits hold clean field mask.
> + */
> +static const u32 vmcs_field_to_evmcs_1[] = {
> +	/* 64 bit rw */
> +	EVMCS1_FIELD(GUEST_RIP, guest_rip,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),

Maybe we should use a single "#include"d file (like vmx_shadow_fields.h)
and share it between HV-on-KVM and KVM-on-HV.

...

> +	EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(CR3_TARGET_VALUE0, cr3_target_value0,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(CR3_TARGET_VALUE1, cr3_target_value1,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(CR3_TARGET_VALUE2, cr3_target_value2,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
> +	EVMCS1_FIELD(CR3_TARGET_VALUE3, cr3_target_value3,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),

We shouldn't use these on Hyper-V, should we (that is, shouldn't the
WARN below fire if you try---and so why include them at all)?

> +
> +static inline u16 get_evmcs_offset(unsigned long field)
> +{
> +	unsigned int index = ROL16(field, 6);
> +
> +	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
> +		WARN_ONCE(1, "kvm: reading unsupported EVMCS field %lx\n",
> +			  field);
> +		return 0;
> +	}
> +
> +	return (u16)vmcs_field_to_evmcs_1[index];
> +}
> +
> +static inline u16 get_evmcs_offset_cf(unsigned long field, u16 *clean_field)
> +{
> +	unsigned int index = ROL16(field, 6);
> +	u32 evmcs_field;
> +
> +	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
> +		WARN_ONCE(1, "kvm: writing unsupported EVMCS field %lx\n",
> +			  field);
> +		return 0;
> +	}
> +
> +	evmcs_field = vmcs_field_to_evmcs_1[index];
> +
> +	*clean_field = evmcs_field >> 16;
> +
> +	return (u16)evmcs_field;
> +}

You can mark this __always_inline, and make it

	if (clean_field)
		*clean_field = evmcs_field >> 16;

or alternatively, use a two-element struct and do

	evmcs_field = &vmcs_field_to_evmcs_1[index];
	if (clean_field)
		*clean_field = evmcs_field->clean_field;
	return evmcs_field->offset;

Also, if you return int and make the WARN_ONCE case return -ENOENT, GCC
should be able to optimize out the "if (!offset)" (which becomes "if
(offset < 0)") in the callers.  Nitpicking, but...

> +static void vmcs_load_enlightened(u64 phys_addr)
> +{
> +	struct hv_vp_assist_page *vp_ap =
> +		hv_get_vp_assist_page(smp_processor_id());
> +
> +	vp_ap->current_nested_vmcs = phys_addr;
> +	vp_ap->enlighten_vmentry = 1;
> +}

evmcs_load?

> +static void evmcs_sanitize_exec_ctrls(u32 *cpu_based_2nd_exec_ctrl,
> +				      u32 *pin_based_exec_ctrl)
> +{
> +	*pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_PML;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_VMFUNC;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_SHADOW_VMCS;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_TSC_SCALING;
> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> +	*pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;

How can these be set?

> @@ -3596,6 +4104,14 @@ static int hardware_enable(void)
>  	if (cr4_read_shadow() & X86_CR4_VMXE)
>  		return -EBUSY;
>  
> +	/*
> +	 * This can happen if we hot-added a CPU but failed to allocate
> +	 * VP assist page for it.
> +	 */
> +	if (static_branch_unlikely(&enable_evmcs) &&
> +	    !hv_get_vp_assist_page(cpu))
> +		return -EFAULT;

-ENODEV?  Maybe add a printk, because this is really rare.

>  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
>  	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
>  	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> @@ -3829,7 +4345,12 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	vmcs_conf->size = vmx_msr_high & 0x1fff;
>  	vmcs_conf->order = get_order(vmcs_conf->size);
>  	vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
> -	vmcs_conf->revision_id = vmx_msr_low;
> +
> +	/* KVM supports Enlightened VMCS v1 only */
> +	if (static_branch_unlikely(&enable_evmcs))
> +		vmcs_conf->revision_id = KVM_EVMCS_VERSION;
> +	else
> +		vmcs_conf->revision_id = vmx_msr_low;
>  
>  	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
>  	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
> @@ -6990,6 +7511,17 @@ static __init int hardware_setup(void)
>  		goto out;
>  	}
>  
> +	if (static_branch_unlikely(&enable_evmcs)) {
> +		evmcs_sanitize_exec_ctrls(&vmcs_config.cpu_based_2nd_exec_ctrl,
> +					  &vmcs_config.pin_based_exec_ctrl);

Why not do it in setup_vmcs_config after the vmcs_conf->vmentry_ctrl
assignment (and pass &vmcs_config, which there is "vmcs_conf", directly
to the function)?  And if sanitizing clears the bits in vmentry_ctl and
vmexit_ctl, there's no need to clear cpu_has_load_perf_global_ctrl.

> +		/*
> +		 * Enlightened VMCSv1 doesn't support these:
> +		 *	GUEST_IA32_PERF_GLOBAL_CTRL	= 0x00002808,
> +		 *	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
> +		 */
> +		cpu_has_load_perf_global_ctrl = false;> +	}
> +
>  	if (boot_cpu_has(X86_FEATURE_NX))
>  		kvm_enable_efer_bits(EFER_NX);
>  
> @@ -8745,6 +9277,10 @@ static void dump_vmcs(void)
>  	if (cpu_has_secondary_exec_ctrls())
>  		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
> +	if (static_branch_unlikely(&enable_evmcs))
> +		evmcs_sanitize_exec_ctrls(&secondary_exec_control,
> +					  &pin_based_exec_ctrl);

This is wrong, we're reading the VMCS so the values must already be
sanitized (and if not, that's the bug and we want dump_vmcs to print the
"wrong" values).

>  	pr_err("*** Guest State ***\n");
>  	pr_err("CR0: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
>  	       vmcs_readl(GUEST_CR0), vmcs_readl(CR0_READ_SHADOW),
> @@ -8784,7 +9320,8 @@ static void dump_vmcs(void)
>  	pr_err("DebugCtl = 0x%016llx  DebugExceptions = 0x%016lx\n",
>  	       vmcs_read64(GUEST_IA32_DEBUGCTL),
>  	       vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS));
> -	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
> +	if (cpu_has_load_perf_global_ctrl &&
> +	    vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
>  		pr_err("PerfGlobCtl = 0x%016llx\n",
>  		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
>  	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
> @@ -8820,7 +9357,8 @@ static void dump_vmcs(void)
>  		pr_err("EFER = 0x%016llx  PAT = 0x%016llx\n",
>  		       vmcs_read64(HOST_IA32_EFER),
>  		       vmcs_read64(HOST_IA32_PAT));
> -	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
> +	if (cpu_has_load_perf_global_ctrl &&
> +	    vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
>  		pr_err("PerfGlobCtl = 0x%016llx\n",
>  		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
>  
> @@ -9397,7 +9935,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
>  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	unsigned long cr3, cr4;
> +	unsigned long cr3, cr4, evmcs_rsp;
>  
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!enable_vnmi &&
> @@ -9463,6 +10001,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  		native_wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
>  
>  	vmx->__launched = vmx->loaded_vmcs->launched;
> +
> +	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
> +		(unsigned long)&current_evmcs->host_rsp : 0;

(If you use text_poke_early, you can do this assignment unconditionally,
since it's just a single lea instruction).

> @@ -9604,6 +10152,11 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	/* Eliminate branch target predictions from guest mode */
>  	vmexit_fill_RSB();
>  
> +	/* All fields are clean at this point */
> +	if (static_branch_unlikely(&enable_evmcs))
> +		current_evmcs->hv_clean_fields |=
> +			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> +
>  	/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
>  	if (vmx->host_debugctlmsr)
>  		update_debugctlmsr(vmx->host_debugctlmsr);
> @@ -12419,7 +12972,36 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  
>  static int __init vmx_init(void)
>  {
> -	int r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
> +	int r;
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +	/*
> +	 * Enlightened VMCS usage should be recommended and the host needs
> +	 * to support eVMCS v1 or above. We can also disable eVMCS support
> +	 * with module parameter.
> +	 */
> +	if (enlightened_vmcs &&
> +	    ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
> +	    (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
> +	    KVM_EVMCS_VERSION) {
> +		int cpu;
> +
> +		/* Check that we have assist pages on all online CPUs */
> +		for_each_online_cpu(cpu) {
> +			if (!hv_get_vp_assist_page(cpu)) {
> +				enlightened_vmcs = false;
> +				break;
> +			}
> +		}
> +		if (enlightened_vmcs) {
> +			pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
> +			static_branch_enable(&enable_evmcs);
> +		}
> +	}

A bit nicer to clear enlightened_vmcs in the "else" branch?

That's it.  Nice work!

Paolo


> +#endif
> +
> +	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
>                       __alignof__(struct vcpu_vmx), THIS_MODULE);
>  	if (r)
>  		return r;
> @@ -12440,6 +13022,29 @@ static void __exit vmx_exit(void)
>  #endif
>  
>  	kvm_exit();
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +	if (static_branch_unlikely(&enable_evmcs)) {
> +		int cpu;
> +		struct hv_vp_assist_page *vp_ap;
> +		/*
> +		 * Reset everything to support using non-enlightened VMCS
> +		 * access later (e.g. when we reload the module with
> +		 * enlightened_vmcs=0)
> +		 */
> +		for_each_online_cpu(cpu) {
> +			vp_ap =	hv_get_vp_assist_page(cpu);
> +
> +			if (!vp_ap)
> +				continue;
> +
> +			vp_ap->current_nested_vmcs = 0;
> +			vp_ap->enlighten_vmentry = 0;
> +		}
> +
> +		static_branch_disable(&enable_evmcs);
> +	}
> +#endif
>  }
>  
>  module_init(vmx_init)
>
Paolo Bonzini March 14, 2018, 2:54 p.m. UTC | #5
On 12/03/2018 15:19, Vitaly Kuznetsov wrote:
>>> Static key is being used to avoid performance penalty for non-Hyper-V
>>> deployments. Tests show we add around 3 (three) CPU cycles on each
>>> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
>>> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
>>> but I don't see a clean way to use static key in assembly.
>> STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
>>
> Thanks for the tip,
> 
> with a single kernel user of these APIs it was easy to miss :-)
> 
> Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
> (afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
> nice if we can make them behave the same way static_branch_likely() and
> friends do: compile into something else when !HAVE_JUMP_LABEL so we can
> avoid nasty #ifdefs in C code.
> 
> That said I'd like to defer the question to KVM maintainers: Paolo,
> Radim, what would you like me to do? Use STATIC_JUMP_IF_TRUE/FALSE as
> they are, try to make them work for !HAVE_JUMP_LABEL and use them or
> maybe we can commit the series as-is and have it as a future
> optimization (e.g. when HAVE_JUMP_LABEL becomes mandatory)?

With a single instruction to patch, poking at the text manually might be
an option...  Otherwise, it's okay as-is.

Paolo
Thomas Gleixner March 14, 2018, 3:19 p.m. UTC | #6
On Mon, 12 Mar 2018, Vitaly Kuznetsov wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:
> >> Static key is being used to avoid performance penalty for non-Hyper-V
> >> deployments. Tests show we add around 3 (three) CPU cycles on each
> >> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> >> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> >> but I don't see a clean way to use static key in assembly.
> >
> > STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
> >
> 
> Thanks for the tip,
> 
> with a single kernel user of these APIs it was easy to miss :-)
> 
> Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
> (afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
> nice if we can make them behave the same way static_branch_likely() and
> friends do: compile into something else when !HAVE_JUMP_LABEL so we can
> avoid nasty #ifdefs in C code.

What's the reason for !jump label builds of a recent kernel? Old compilers?

Thanks,

	tglx
Vitaly Kuznetsov March 14, 2018, 5:22 p.m. UTC | #7
Thomas Gleixner <tglx@linutronix.de> writes:

> On Mon, 12 Mar 2018, Vitaly Kuznetsov wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>> > On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:
>> >> Static key is being used to avoid performance penalty for non-Hyper-V
>> >> deployments. Tests show we add around 3 (three) CPU cycles on each
>> >> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
>> >> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
>> >> but I don't see a clean way to use static key in assembly.
>> >
>> > STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
>> >
>> 
>> Thanks for the tip,
>> 
>> with a single kernel user of these APIs it was easy to miss :-)
>> 
>> Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
>> (afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
>> nice if we can make them behave the same way static_branch_likely() and
>> friends do: compile into something else when !HAVE_JUMP_LABEL so we can
>> avoid nasty #ifdefs in C code.
>
> What's the reason for !jump label builds of a recent kernel? Old compilers?
>

To be honest I don't see any, we can start depending on HAVE_JUMP_LABEL
for CONFIG_KVM I guess.
Thomas Gleixner March 14, 2018, 7:59 p.m. UTC | #8
On Wed, 14 Mar 2018, Vitaly Kuznetsov wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > On Mon, 12 Mar 2018, Vitaly Kuznetsov wrote:
> >> Thomas Gleixner <tglx@linutronix.de> writes:
> >> > On Fri, 9 Mar 2018, Vitaly Kuznetsov wrote:
> >> >> Static key is being used to avoid performance penalty for non-Hyper-V
> >> >> deployments. Tests show we add around 3 (three) CPU cycles on each
> >> >> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> >> >> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> >> >> but I don't see a clean way to use static key in assembly.
> >> >
> >> > STATIC_JUMP_IF_TRUE, STATIC_JUMP_IF_FALSE are your friends.
> >> >
> >> 
> >> Thanks for the tip,
> >> 
> >> with a single kernel user of these APIs it was easy to miss :-)
> >> 
> >> Unfortunately, these APIs are only present if HAVE_JUMP_LABEL and
> >> (afaiu) we still care about KVM on !HAVE_JUMP_LABEL builds. It would be
> >> nice if we can make them behave the same way static_branch_likely() and
> >> friends do: compile into something else when !HAVE_JUMP_LABEL so we can
> >> avoid nasty #ifdefs in C code.
> >
> > What's the reason for !jump label builds of a recent kernel? Old compilers?
> >
> 
> To be honest I don't see any, we can start depending on HAVE_JUMP_LABEL
> for CONFIG_KVM I guess.

We currently try to move the minimum compiler version to one which provides
jump label support, so this should be a non issue.

@Peter: What was the final conclusion of this discussion?

Thanks,

	tglx
Peter Zijlstra March 14, 2018, 8:06 p.m. UTC | #9
On Wed, Mar 14, 2018 at 08:59:25PM +0100, Thomas Gleixner wrote:
> We currently try to move the minimum compiler version to one which provides
> jump label support, so this should be a non issue.
> 
> @Peter: What was the final conclusion of this discussion?

We all said we'd do it. I just haven't come around to resending you that
patch. I'll try and do so tomorrow.
Vitaly Kuznetsov March 15, 2018, 9:56 a.m. UTC | #10
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 09/03/2018 15:02, Vitaly Kuznetsov wrote:
>> Enlightened VMCS is just a structure in memory, the main benefit
>> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
>> mask: we tell the underlying hypervisor which fields were modified
>> since VMEXIT so there's no need to inspect them all.
>> 
>> Tight CPUID loop test shows significant speedup:
>> Before: 18890 cycles
>> After: 8304 cycles
>> 
>> Static key is being used to avoid performance penalty for non-Hyper-V
>> deployments. Tests show we add around 3 (three) CPU cycles on each
>> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
>> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
>> but I don't see a clean way to use static key in assembly.
>
> If you want to live dangerously, you can use text_poke_early to change
> the vmwrite to mov.  It's just a single instruction, so it's probably
> not too hard.
>

I'd say it's not worth it ...

>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> Changes since v2:
>> - define KVM_EVMCS_VERSION [Radim Krčmář]
>> - WARN_ONCE in get_evmcs_offset[,_cf] [Radim Krčmář]
>> - add evmcs_sanitize_exec_ctrls() and use it in hardware_setup() and
>>   dump_vmcs() [Radim Krčmář]
>> ---
>>  arch/x86/kvm/vmx.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 615 insertions(+), 10 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 051dab74e4e9..44b6efa7d54e 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -53,6 +53,7 @@
>>  #include <asm/mmu_context.h>
>>  #include <asm/microcode.h>
>>  #include <asm/nospec-branch.h>
>> +#include <asm/mshyperv.h>
>>  
>>  #include "trace.h"
>>  #include "pmu.h"
>> @@ -1000,6 +1001,484 @@ static const u32 vmx_msr_index[] = {
>>  	MSR_EFER, MSR_TSC_AUX, MSR_STAR,
>>  };
>>  
>> +DEFINE_STATIC_KEY_FALSE(enable_evmcs);
>> +
>> +#define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs))
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +static bool __read_mostly enlightened_vmcs = true;
>> +module_param(enlightened_vmcs, bool, 0444);
>> +
>> +#define KVM_EVMCS_VERSION 1
>> +
>> +#define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x)
>> +#define EVMCS1_FIELD(number, name, clean_mask)[ROL16(number, 6)] = \
>> +		(u32)EVMCS1_OFFSET(name) | ((u32)clean_mask << 16)
>> +
>> +/*
>> + * Lower 16 bits encode offset of the field in struct hv_enlightened_vmcs,
>> + * upped 16 bits hold clean field mask.
>> + */
>> +static const u32 vmcs_field_to_evmcs_1[] = {
>> +	/* 64 bit rw */
>> +	EVMCS1_FIELD(GUEST_RIP, guest_rip,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
>
> Maybe we should use a single "#include"d file (like vmx_shadow_fields.h)
> and share it between HV-on-KVM and KVM-on-HV.
>
> ...

Actually, yes, looking at 13k+ lines of code in vmx.c makes me think
it's time we start doing something about it :-)

>
>> +	EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(CR3_TARGET_VALUE0, cr3_target_value0,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(CR3_TARGET_VALUE1, cr3_target_value1,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(CR3_TARGET_VALUE2, cr3_target_value2,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(CR3_TARGET_VALUE3, cr3_target_value3,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>
> We shouldn't use these on Hyper-V, should we (that is, shouldn't the
> WARN below fire if you try---and so why include them at all)?
>

True, these shouldn't be used and that's why there's no clean field
assigned to them. They, however, do have a corresponding eVMCS field.
I will try removing them in next version.

>> +
>> +static inline u16 get_evmcs_offset(unsigned long field)
>> +{
>> +	unsigned int index = ROL16(field, 6);
>> +
>> +	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
>> +		WARN_ONCE(1, "kvm: reading unsupported EVMCS field %lx\n",
>> +			  field);
>> +		return 0;
>> +	}
>> +
>> +	return (u16)vmcs_field_to_evmcs_1[index];
>> +}
>> +
>> +static inline u16 get_evmcs_offset_cf(unsigned long field, u16 *clean_field)
>> +{
>> +	unsigned int index = ROL16(field, 6);
>> +	u32 evmcs_field;
>> +
>> +	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
>> +		WARN_ONCE(1, "kvm: writing unsupported EVMCS field %lx\n",
>> +			  field);
>> +		return 0;
>> +	}
>> +
>> +	evmcs_field = vmcs_field_to_evmcs_1[index];
>> +
>> +	*clean_field = evmcs_field >> 16;
>> +
>> +	return (u16)evmcs_field;
>> +}
>
> You can mark this __always_inline, and make it
>
> 	if (clean_field)
> 		*clean_field = evmcs_field >> 16;
>
> or alternatively, use a two-element struct and do
>
> 	evmcs_field = &vmcs_field_to_evmcs_1[index];
> 	if (clean_field)
> 		*clean_field = evmcs_field->clean_field;
> 	return evmcs_field->offset;
>
> Also, if you return int and make the WARN_ONCE case return -ENOENT, GCC
> should be able to optimize out the "if (!offset)" (which becomes "if
> (offset < 0)") in the callers.  Nitpicking, but...
>

Ok, good suggestion, I'll try.

>> +static void vmcs_load_enlightened(u64 phys_addr)
>> +{
>> +	struct hv_vp_assist_page *vp_ap =
>> +		hv_get_vp_assist_page(smp_processor_id());
>> +
>> +	vp_ap->current_nested_vmcs = phys_addr;
>> +	vp_ap->enlighten_vmentry = 1;
>> +}
>
> evmcs_load?
>

Works for me,

>> +static void evmcs_sanitize_exec_ctrls(u32 *cpu_based_2nd_exec_ctrl,
>> +				      u32 *pin_based_exec_ctrl)
>> +{
>> +	*pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_PML;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_VMFUNC;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_SHADOW_VMCS;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_TSC_SCALING;
>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>> +	*pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
>
> How can these be set?
>

They can not if Hyper-V behaves but Radim didn't want to trust it -- so
the suggestion was to forcefully disable unsupported controls.

>> @@ -3596,6 +4104,14 @@ static int hardware_enable(void)
>>  	if (cr4_read_shadow() & X86_CR4_VMXE)
>>  		return -EBUSY;
>>  
>> +	/*
>> +	 * This can happen if we hot-added a CPU but failed to allocate
>> +	 * VP assist page for it.
>> +	 */
>> +	if (static_branch_unlikely(&enable_evmcs) &&
>> +	    !hv_get_vp_assist_page(cpu))
>> +		return -EFAULT;
>
> -ENODEV?  Maybe add a printk, because this is really rare.
>

Ok,

>>  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
>>  	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
>>  	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
>> @@ -3829,7 +4345,12 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>>  	vmcs_conf->size = vmx_msr_high & 0x1fff;
>>  	vmcs_conf->order = get_order(vmcs_conf->size);
>>  	vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
>> -	vmcs_conf->revision_id = vmx_msr_low;
>> +
>> +	/* KVM supports Enlightened VMCS v1 only */
>> +	if (static_branch_unlikely(&enable_evmcs))
>> +		vmcs_conf->revision_id = KVM_EVMCS_VERSION;
>> +	else
>> +		vmcs_conf->revision_id = vmx_msr_low;
>>  
>>  	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
>>  	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
>> @@ -6990,6 +7511,17 @@ static __init int hardware_setup(void)
>>  		goto out;
>>  	}
>>  
>> +	if (static_branch_unlikely(&enable_evmcs)) {
>> +		evmcs_sanitize_exec_ctrls(&vmcs_config.cpu_based_2nd_exec_ctrl,
>> +					  &vmcs_config.pin_based_exec_ctrl);
>
> Why not do it in setup_vmcs_config after the vmcs_conf->vmentry_ctrl
> assignment (and pass &vmcs_config, which there is "vmcs_conf", directly
> to the function)?  And if sanitizing clears the bits in vmentry_ctl and
> vmexit_ctl, there's no need to clear cpu_has_load_perf_global_ctrl.
>

Ok, if we decide to keep 'sanitization' in place.

>> +		/*
>> +		 * Enlightened VMCSv1 doesn't support these:
>> +		 *	GUEST_IA32_PERF_GLOBAL_CTRL	= 0x00002808,
>> +		 *	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
>> +		 */
>> +		cpu_has_load_perf_global_ctrl = false;> +	}
>> +
>>  	if (boot_cpu_has(X86_FEATURE_NX))
>>  		kvm_enable_efer_bits(EFER_NX);
>>  
>> @@ -8745,6 +9277,10 @@ static void dump_vmcs(void)
>>  	if (cpu_has_secondary_exec_ctrls())
>>  		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>>  
>> +	if (static_branch_unlikely(&enable_evmcs))
>> +		evmcs_sanitize_exec_ctrls(&secondary_exec_control,
>> +					  &pin_based_exec_ctrl);
>
> This is wrong, we're reading the VMCS so the values must already be
> sanitized (and if not, that's the bug and we want dump_vmcs to print the
> "wrong" values).

The problem is that we vmcs_read these fields later in the function and
this will now WARN(). Initally, there was no WARN() for non-existent
fields so this could work (we would just print zeroes for unsupported
fields). Maybe, additional WARN_ON() is not a big deal here.

In reality, these controls should never be set.

>
>>  	pr_err("*** Guest State ***\n");
>>  	pr_err("CR0: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
>>  	       vmcs_readl(GUEST_CR0), vmcs_readl(CR0_READ_SHADOW),
>> @@ -8784,7 +9320,8 @@ static void dump_vmcs(void)
>>  	pr_err("DebugCtl = 0x%016llx  DebugExceptions = 0x%016lx\n",
>>  	       vmcs_read64(GUEST_IA32_DEBUGCTL),
>>  	       vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS));
>> -	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
>> +	if (cpu_has_load_perf_global_ctrl &&
>> +	    vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
>>  		pr_err("PerfGlobCtl = 0x%016llx\n",
>>  		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
>>  	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
>> @@ -8820,7 +9357,8 @@ static void dump_vmcs(void)
>>  		pr_err("EFER = 0x%016llx  PAT = 0x%016llx\n",
>>  		       vmcs_read64(HOST_IA32_EFER),
>>  		       vmcs_read64(HOST_IA32_PAT));
>> -	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
>> +	if (cpu_has_load_perf_global_ctrl &&
>> +	    vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
>>  		pr_err("PerfGlobCtl = 0x%016llx\n",
>>  		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
>>  
>> @@ -9397,7 +9935,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
>>  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> -	unsigned long cr3, cr4;
>> +	unsigned long cr3, cr4, evmcs_rsp;
>>  
>>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>>  	if (unlikely(!enable_vnmi &&
>> @@ -9463,6 +10001,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>  		native_wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
>>  
>>  	vmx->__launched = vmx->loaded_vmcs->launched;
>> +
>> +	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
>> +		(unsigned long)&current_evmcs->host_rsp : 0;
>
> (If you use text_poke_early, you can do this assignment unconditionally,
> since it's just a single lea instruction).
>

Something to take a look at)

>> @@ -9604,6 +10152,11 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>  	/* Eliminate branch target predictions from guest mode */
>>  	vmexit_fill_RSB();
>>  
>> +	/* All fields are clean at this point */
>> +	if (static_branch_unlikely(&enable_evmcs))
>> +		current_evmcs->hv_clean_fields |=
>> +			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
>> +
>>  	/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
>>  	if (vmx->host_debugctlmsr)
>>  		update_debugctlmsr(vmx->host_debugctlmsr);
>> @@ -12419,7 +12972,36 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>>  
>>  static int __init vmx_init(void)
>>  {
>> -	int r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
>> +	int r;
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +	/*
>> +	 * Enlightened VMCS usage should be recommended and the host needs
>> +	 * to support eVMCS v1 or above. We can also disable eVMCS support
>> +	 * with module parameter.
>> +	 */
>> +	if (enlightened_vmcs &&
>> +	    ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
>> +	    (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
>> +	    KVM_EVMCS_VERSION) {
>> +		int cpu;
>> +
>> +		/* Check that we have assist pages on all online CPUs */
>> +		for_each_online_cpu(cpu) {
>> +			if (!hv_get_vp_assist_page(cpu)) {
>> +				enlightened_vmcs = false;
>> +				break;
>> +			}
>> +		}
>> +		if (enlightened_vmcs) {
>> +			pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
>> +			static_branch_enable(&enable_evmcs);
>> +		}
>> +	}
>
> A bit nicer to clear enlightened_vmcs in the "else" branch?

Yes, as a precaution, why not. (But we should solely rely on
'enable_evmcs' later on).

>
> That's it.  Nice work!
>
> Paolo
>
>> +#endif
>> +
>> +	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
>>                       __alignof__(struct vcpu_vmx), THIS_MODULE);
>>  	if (r)
>>  		return r;
>> @@ -12440,6 +13022,29 @@ static void __exit vmx_exit(void)
>>  #endif
>>  
>>  	kvm_exit();
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +	if (static_branch_unlikely(&enable_evmcs)) {
>> +		int cpu;
>> +		struct hv_vp_assist_page *vp_ap;
>> +		/*
>> +		 * Reset everything to support using non-enlightened VMCS
>> +		 * access later (e.g. when we reload the module with
>> +		 * enlightened_vmcs=0)
>> +		 */
>> +		for_each_online_cpu(cpu) {
>> +			vp_ap =	hv_get_vp_assist_page(cpu);
>> +
>> +			if (!vp_ap)
>> +				continue;
>> +
>> +			vp_ap->current_nested_vmcs = 0;
>> +			vp_ap->enlighten_vmentry = 0;
>> +		}
>> +
>> +		static_branch_disable(&enable_evmcs);
>> +	}
>> +#endif
>>  }
>>  
>>  module_init(vmx_init)
>>
Paolo Bonzini March 15, 2018, 11:01 a.m. UTC | #11
On 15/03/2018 10:56, Vitaly Kuznetsov wrote:
>> +	EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
>> +	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count,
>> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),

Hmm, actually these six are used.  I guess
HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL is the best we can do, apart from
asking Microsoft to fix the spec.


>>> +{
>>> +	*pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_PML;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_VMFUNC;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_SHADOW_VMCS;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_TSC_SCALING;
>>> +	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>>> +	*pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
>> How can these be set?
>>
> They can not if Hyper-V behaves but Radim didn't want to trust it -- so
> the suggestion was to forcefully disable unsupported controls.

Yeah, it's good to have, especially if placed before we start using the
values that are read.

>> This is wrong, we're reading the VMCS so the values must already be
>> sanitized (and if not, that's the bug and we want dump_vmcs to print the
>> "wrong" values).
> 
> The problem is that we vmcs_read these fields later in the function and
> this will now WARN(). Initally, there was no WARN() for non-existent
> fields so this could work (we would just print zeroes for unsupported
> fields). Maybe, additional WARN_ON() is not a big deal here.

If you WARN(), isn't it because the secondary_exec_control had a bad
value to begin with?  As you say, the controls should never be set.

Thanks,

Paolo
Vitaly Kuznetsov March 15, 2018, 3:19 p.m. UTC | #12
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 09/03/2018 15:02, Vitaly Kuznetsov wrote:
>> Enlightened VMCS is just a structure in memory, the main benefit
>> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
>> mask: we tell the underlying hypervisor which fields were modified
>> since VMEXIT so there's no need to inspect them all.
>> 
>> Tight CPUID loop test shows significant speedup:
>> Before: 18890 cycles
>> After: 8304 cycles
>> 
>> Static key is being used to avoid performance penalty for non-Hyper-V
>> deployments. Tests show we add around 3 (three) CPU cycles on each
>> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
>> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
>> but I don't see a clean way to use static key in assembly.
>
> If you want to live dangerously, you can use text_poke_early to change
> the vmwrite to mov.  It's just a single instruction, so it's probably
> not too hard.

It is not:

+#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_X86_64)
+
+/* Luckily, both original and new instructions are of the same length */
+#define EVMCS_RSP_OPCODE_LEN 3
+static evmcs_patch_vmx_cpu_run(void)
+{
+       u8 *addr;
+       u8 opcode_old[] = {0x0f, 0x79, 0xd4}; // vmwrite rsp, rdx
+       u8 opcode_new[] = {0x48, 0x89, 0x26}; // mov rsp, (rsi)
+
+       /*
+        * What we're searching for MUST be present in vmx_cpu_run().
+        * We replace the first occurance only.
+        */
+       for (addr = (u8 *)vmx_vcpu_run; ; addr++) {
+               if (!memcmp(addr, opcode_old, EVMCS_RSP_OPCODE_LEN)) {
+                       /*
+                        * vmx_vcpu_run is not currently running on other CPUs but
+                        * using text_poke_early() would require us to do manual
+                        * RW remapping of the area.
+                        */
+                       text_poke(addr, opcode_new, EVMCS_RSP_OPCODE_LEN);
+                       break;
+               }
+       }
+}
+#endif
+

text_poke() also needs to be exported.

This works. But hell, this is a crude hack :-) Not sure if there's a
cleaner way to find what needs to be patched without something like jump
label table ...
Radim Krčmář March 15, 2018, 5:02 p.m. UTC | #13
2018-03-15 16:19+0100, Vitaly Kuznetsov:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
> > On 09/03/2018 15:02, Vitaly Kuznetsov wrote:
> >> Enlightened VMCS is just a structure in memory, the main benefit
> >> besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
> >> mask: we tell the underlying hypervisor which fields were modified
> >> since VMEXIT so there's no need to inspect them all.
> >> 
> >> Tight CPUID loop test shows significant speedup:
> >> Before: 18890 cycles
> >> After: 8304 cycles
> >> 
> >> Static key is being used to avoid performance penalty for non-Hyper-V
> >> deployments. Tests show we add around 3 (three) CPU cycles on each
> >> VMEXIT (1077.5 cycles before, 1080.7 cycles after for the same CPUID
> >> loop on bare metal). We can probably avoid one test/jmp in vmx_vcpu_run()
> >> but I don't see a clean way to use static key in assembly.
> >
> > If you want to live dangerously, you can use text_poke_early to change
> > the vmwrite to mov.  It's just a single instruction, so it's probably
> > not too hard.
> 
> It is not:
> 
> +#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_X86_64)
> +
> +/* Luckily, both original and new instructions are of the same length */
> +#define EVMCS_RSP_OPCODE_LEN 3
> +static evmcs_patch_vmx_cpu_run(void)
> +{
> +       u8 *addr;
> +       u8 opcode_old[] = {0x0f, 0x79, 0xd4}; // vmwrite rsp, rdx
> +       u8 opcode_new[] = {0x48, 0x89, 0x26}; // mov rsp, (rsi)
> +
> +       /*
> +        * What we're searching for MUST be present in vmx_cpu_run().
> +        * We replace the first occurance only.
> +        */
> +       for (addr = (u8 *)vmx_vcpu_run; ; addr++) {
> +               if (!memcmp(addr, opcode_old, EVMCS_RSP_OPCODE_LEN)) {
> +                       /*
> +                        * vmx_vcpu_run is not currently running on other CPUs but
> +                        * using text_poke_early() would require us to do manual
> +                        * RW remapping of the area.
> +                        */
> +                       text_poke(addr, opcode_new, EVMCS_RSP_OPCODE_LEN);
> +                       break;
> +               }
> +       }
> +}
> +#endif
> +
> 
> text_poke() also needs to be exported.
> 
> This works. But hell, this is a crude hack :-) Not sure if there's a
> cleaner way to find what needs to be patched without something like jump
> label table ...

Yeah, I can see us accidently patching parts of other instructions. :)

The target instruction address can be made into a C-accessible symbol
with the same trick that vmx_return uses -- add a .global containing the
address of a label (not sure if a more direct approach would work).

The evil in me likes it.  (The good is too lazy to add a decent patching
infrastructure for just one user.)

I would be a bit happier if we didn't assume the desired instruction and
therefore put constraints on a remote code.
We actually already have mov in the assembly:

  "cmp %%" _ASM_SP ", %c[host_rsp](%0) \n\t"
  "je 1f \n\t"
  "mov %%" _ASM_SP ", %c[host_rsp](%0) \n\t" // here
  __ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
  "1: \n\t"

Is there a drawback in switching '%c[host_rsp](%0)' to be a general
memory pointer and put either &vmx->host_rsp or &current_evmcs->host_rsp
in there?

We could just overwrite ASM_VMX_VMWRITE_RSP_RDX with a nop then. :)

Thanks.
Radim Krčmář March 15, 2018, 5:28 p.m. UTC | #14
2018-03-15 18:02+0100, Radim Krčmář:
> We actually already have mov in the assembly:
> 
>   "cmp %%" _ASM_SP ", %c[host_rsp](%0) \n\t"

Oh hell, I didn't pay attention to this line before.

>   "je 1f \n\t"
>   "mov %%" _ASM_SP ", %c[host_rsp](%0) \n\t" // here
>   __ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"

I bet this path is executed only once in VM's lifetime and what we're
doing is wasting more resources than we're ever going to save ...

>   "1: \n\t"
Vitaly Kuznetsov March 15, 2018, 6:04 p.m. UTC | #15
Radim Krčmář <rkrcmar@redhat.com> writes:

> 2018-03-15 18:02+0100, Radim Krčmář:
>> We actually already have mov in the assembly:
>> 
>>   "cmp %%" _ASM_SP ", %c[host_rsp](%0) \n\t"
>
> Oh hell, I didn't pay attention to this line before.
>

This is still going to work if we conditionally replace it with pointer
to evmcs as you suggested before but ...


>>   "je 1f \n\t"
>>   "mov %%" _ASM_SP ", %c[host_rsp](%0) \n\t" // here
>>   __ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
>
> I bet this path is executed only once in VM's lifetime and what we're
> doing is wasting more resources than we're ever going to save ...
>

yes, we're not gonna save anything...
Thomas Gleixner March 15, 2018, 7:28 p.m. UTC | #16
On Thu, 15 Mar 2018, Radim Krčmář wrote:
> 2018-03-15 16:19+0100, Vitaly Kuznetsov:
> > This works. But hell, this is a crude hack :-) Not sure if there's a
> > cleaner way to find what needs to be patched without something like jump
> > label table ...
> 
> Yeah, I can see us accidently patching parts of other instructions. :)
> 
> The target instruction address can be made into a C-accessible symbol
> with the same trick that vmx_return uses -- add a .global containing the
> address of a label (not sure if a more direct approach would work).
> 
> The evil in me likes it.  (The good is too lazy to add a decent patching
> infrastructure for just one user.)

Can we just use jump labels please? There is agreement that 4.17 will have
a dependency on a jump label capable compiler for x86.

Thanks,

	tglx
Radim Krčmář March 15, 2018, 7:43 p.m. UTC | #17
2018-03-15 20:28+0100, Thomas Gleixner:
> On Thu, 15 Mar 2018, Radim Krčmář wrote:
> > 2018-03-15 16:19+0100, Vitaly Kuznetsov:
> > > This works. But hell, this is a crude hack :-) Not sure if there's a
> > > cleaner way to find what needs to be patched without something like jump
> > > label table ...
> > 
> > Yeah, I can see us accidently patching parts of other instructions. :)
> > 
> > The target instruction address can be made into a C-accessible symbol
> > with the same trick that vmx_return uses -- add a .global containing the
> > address of a label (not sure if a more direct approach would work).
> > 
> > The evil in me likes it.  (The good is too lazy to add a decent patching
> > infrastructure for just one user.)
> 
> Can we just use jump labels please? There is agreement that 4.17 will have
> a dependency on a jump label capable compiler for x86.

Luckily, it turned out that the path is very cold and should use the
simple test-and-jump.
diff mbox

Patch

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 051dab74e4e9..44b6efa7d54e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -53,6 +53,7 @@ 
 #include <asm/mmu_context.h>
 #include <asm/microcode.h>
 #include <asm/nospec-branch.h>
+#include <asm/mshyperv.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -1000,6 +1001,484 @@  static const u32 vmx_msr_index[] = {
 	MSR_EFER, MSR_TSC_AUX, MSR_STAR,
 };
 
+DEFINE_STATIC_KEY_FALSE(enable_evmcs);
+
+#define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs))
+
+#if IS_ENABLED(CONFIG_HYPERV)
+static bool __read_mostly enlightened_vmcs = true;
+module_param(enlightened_vmcs, bool, 0444);
+
+#define KVM_EVMCS_VERSION 1
+
+#define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x)
+#define EVMCS1_FIELD(number, name, clean_mask)[ROL16(number, 6)] = \
+		(u32)EVMCS1_OFFSET(name) | ((u32)clean_mask << 16)
+
+/*
+ * Lower 16 bits encode offset of the field in struct hv_enlightened_vmcs,
+ * upped 16 bits hold clean field mask.
+ */
+static const u32 vmcs_field_to_evmcs_1[] = {
+	/* 64 bit rw */
+	EVMCS1_FIELD(GUEST_RIP, guest_rip,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(GUEST_RSP, guest_rsp,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC),
+	EVMCS1_FIELD(GUEST_RFLAGS, guest_rflags,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC),
+	EVMCS1_FIELD(HOST_IA32_PAT, host_ia32_pat,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_IA32_EFER, host_ia32_efer,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_CR0, host_cr0,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_CR3, host_cr3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_CR4, host_cr4,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_RIP, host_rip,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(IO_BITMAP_A, io_bitmap_a,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP),
+	EVMCS1_FIELD(IO_BITMAP_B, io_bitmap_b,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP),
+	EVMCS1_FIELD(MSR_BITMAP, msr_bitmap,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP),
+	EVMCS1_FIELD(GUEST_ES_BASE, guest_es_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_CS_BASE, guest_cs_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_SS_BASE, guest_ss_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_DS_BASE, guest_ds_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_FS_BASE, guest_fs_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GS_BASE, guest_gs_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_LDTR_BASE, guest_ldtr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_TR_BASE, guest_tr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GDTR_BASE, guest_gdtr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IDTR_BASE, guest_idtr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(TSC_OFFSET, tsc_offset,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
+	EVMCS1_FIELD(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
+	EVMCS1_FIELD(VMCS_LINK_POINTER, vmcs_link_pointer,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_IA32_PAT, guest_ia32_pat,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_IA32_EFER, guest_ia32_efer,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_PDPTR0, guest_pdptr0,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_PDPTR1, guest_pdptr1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_PDPTR2, guest_pdptr2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_PDPTR3, guest_pdptr3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(CR0_READ_SHADOW, cr0_read_shadow,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(CR4_READ_SHADOW, cr4_read_shadow,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(GUEST_CR0, guest_cr0,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(GUEST_CR3, guest_cr3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(GUEST_CR4, guest_cr4,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(GUEST_DR7, guest_dr7,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR),
+	EVMCS1_FIELD(HOST_FS_BASE, host_fs_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_GS_BASE, host_gs_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_TR_BASE, host_tr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_GDTR_BASE, host_gdtr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IDTR_BASE, host_idtr_base,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_RSP, host_rsp,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(EPT_POINTER, ept_pointer,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT),
+	EVMCS1_FIELD(GUEST_BNDCFGS, guest_bndcfgs,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(XSS_EXIT_BITMAP, xss_exit_bitmap,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
+
+	/* 64 bit read only */
+	EVMCS1_FIELD(GUEST_PHYSICAL_ADDRESS, guest_physical_address,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(EXIT_QUALIFICATION, exit_qualification,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	/*
+	 * Not defined in KVM:
+	 *
+	 * EVMCS1_FIELD(0x00006402, exit_io_instruction_ecx,
+	 *		HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE);
+	 * EVMCS1_FIELD(0x00006404, exit_io_instruction_esi,
+	 *		HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE);
+	 * EVMCS1_FIELD(0x00006406, exit_io_instruction_esi,
+	 *		HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE);
+	 * EVMCS1_FIELD(0x00006408, exit_io_instruction_eip,
+	 *		HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE);
+	 */
+	EVMCS1_FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+
+	/*
+	 * No mask defined in the spec as Hyper-V doesn't currently support
+	 * these. Future proof by resetting the whole clean field mask on
+	 * access.
+	 */
+	EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(CR3_TARGET_VALUE0, cr3_target_value0,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(CR3_TARGET_VALUE1, cr3_target_value1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(CR3_TARGET_VALUE2, cr3_target_value2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(CR3_TARGET_VALUE3, cr3_target_value3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+
+	/* 32 bit rw */
+	EVMCS1_FIELD(TPR_THRESHOLD, tpr_threshold,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC),
+	EVMCS1_FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC),
+	EVMCS1_FIELD(EXCEPTION_BITMAP, exception_bitmap,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN),
+	EVMCS1_FIELD(VM_ENTRY_CONTROLS, vm_entry_controls,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY),
+	EVMCS1_FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT),
+	EVMCS1_FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		     vm_entry_exception_error_code,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT),
+	EVMCS1_FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT),
+	EVMCS1_FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
+	EVMCS1_FIELD(VM_EXIT_CONTROLS, vm_exit_controls,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
+	EVMCS1_FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
+	EVMCS1_FIELD(GUEST_ES_LIMIT, guest_es_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_CS_LIMIT, guest_cs_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_SS_LIMIT, guest_ss_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_DS_LIMIT, guest_ds_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_FS_LIMIT, guest_fs_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GS_LIMIT, guest_gs_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_TR_LIMIT, guest_tr_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_ACTIVITY_STATE, guest_activity_state,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+	EVMCS1_FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
+
+	/* 32 bit read only */
+	EVMCS1_FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(VM_EXIT_REASON, vm_exit_reason,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+
+	/* No mask defined in the spec (not used) */
+	EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(CR3_TARGET_COUNT, cr3_target_count,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+	EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL),
+
+	/* 16 bit rw */
+	EVMCS1_FIELD(HOST_ES_SELECTOR, host_es_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_CS_SELECTOR, host_cs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_SS_SELECTOR, host_ss_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_DS_SELECTOR, host_ds_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_FS_SELECTOR, host_fs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_GS_SELECTOR, host_gs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(HOST_TR_SELECTOR, host_tr_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
+	EVMCS1_FIELD(GUEST_ES_SELECTOR, guest_es_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_CS_SELECTOR, guest_cs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_SS_SELECTOR, guest_ss_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_DS_SELECTOR, guest_ds_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_FS_SELECTOR, guest_fs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_GS_SELECTOR, guest_gs_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_TR_SELECTOR, guest_tr_selector,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT),
+};
+
+static inline u16 get_evmcs_offset(unsigned long field)
+{
+	unsigned int index = ROL16(field, 6);
+
+	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
+		WARN_ONCE(1, "kvm: reading unsupported EVMCS field %lx\n",
+			  field);
+		return 0;
+	}
+
+	return (u16)vmcs_field_to_evmcs_1[index];
+}
+
+static inline u16 get_evmcs_offset_cf(unsigned long field, u16 *clean_field)
+{
+	unsigned int index = ROL16(field, 6);
+	u32 evmcs_field;
+
+	if (index >= ARRAY_SIZE(vmcs_field_to_evmcs_1)) {
+		WARN_ONCE(1, "kvm: writing unsupported EVMCS field %lx\n",
+			  field);
+		return 0;
+	}
+
+	evmcs_field = vmcs_field_to_evmcs_1[index];
+
+	*clean_field = evmcs_field >> 16;
+
+	return (u16)evmcs_field;
+}
+
+static inline void evmcs_write64(unsigned long field, u64 value)
+{
+	u16 clean_field;
+	u16 offset = get_evmcs_offset_cf(field, &clean_field);
+
+	if (!offset)
+		return;
+
+	*(u64 *)((char *)current_evmcs + offset) = value;
+
+	current_evmcs->hv_clean_fields &= ~clean_field;
+}
+
+static inline void evmcs_write32(unsigned long field, u32 value)
+{
+	u16 clean_field;
+	u16 offset = get_evmcs_offset_cf(field, &clean_field);
+
+	if (!offset)
+		return;
+
+	*(u32 *)((char *)current_evmcs + offset) = value;
+	current_evmcs->hv_clean_fields &= ~clean_field;
+}
+
+static inline void evmcs_write16(unsigned long field, u16 value)
+{
+	u16 clean_field;
+	u16 offset = get_evmcs_offset_cf(field, &clean_field);
+
+	if (!offset)
+		return;
+
+	*(u16 *)((char *)current_evmcs + offset) = value;
+	current_evmcs->hv_clean_fields &= ~clean_field;
+}
+
+static inline u64 evmcs_read64(unsigned long field)
+{
+	u16 offset = get_evmcs_offset(field);
+
+	if (!offset)
+		return 0;
+
+	return *(u64 *)((char *)current_evmcs + offset);
+}
+
+static inline u32 evmcs_read32(unsigned long field)
+{
+	u16 offset = get_evmcs_offset(field);
+
+	if (!offset)
+		return 0;
+
+	return *(u32 *)((char *)current_evmcs + offset);
+}
+
+static inline u16 evmcs_read16(unsigned long field)
+{
+	u16 offset = get_evmcs_offset(field);
+
+	if (!offset)
+		return 0;
+
+	return *(u16 *)((char *)current_evmcs + offset);
+}
+
+static void vmcs_load_enlightened(u64 phys_addr)
+{
+	struct hv_vp_assist_page *vp_ap =
+		hv_get_vp_assist_page(smp_processor_id());
+
+	vp_ap->current_nested_vmcs = phys_addr;
+	vp_ap->enlighten_vmentry = 1;
+}
+
+static void evmcs_sanitize_exec_ctrls(u32 *cpu_based_2nd_exec_ctrl,
+				      u32 *pin_based_exec_ctrl)
+{
+	/*
+	 * Enlightened VMCSv1 doesn't support these:
+	 *
+	 *	POSTED_INTR_NV                  = 0x00000002,
+	 *	GUEST_INTR_STATUS               = 0x00000810,
+	 *	APIC_ACCESS_ADDR		= 0x00002014,
+	 *	POSTED_INTR_DESC_ADDR           = 0x00002016,
+	 *	EOI_EXIT_BITMAP0                = 0x0000201c,
+	 *	EOI_EXIT_BITMAP1                = 0x0000201e,
+	 *	EOI_EXIT_BITMAP2                = 0x00002020,
+	 *	EOI_EXIT_BITMAP3                = 0x00002022,
+	 */
+	*pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
+
+	/*
+	 *	GUEST_PML_INDEX			= 0x00000812,
+	 *	PML_ADDRESS			= 0x0000200e,
+	 */
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_PML;
+
+	/*	VM_FUNCTION_CONTROL             = 0x00002018, */
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_ENABLE_VMFUNC;
+
+	/*
+	 *	EPTP_LIST_ADDRESS               = 0x00002024,
+	 *	VMREAD_BITMAP                   = 0x00002026,
+	 *	VMWRITE_BITMAP                  = 0x00002028,
+	 */
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_SHADOW_VMCS;
+
+	/*
+	 *	TSC_MULTIPLIER                  = 0x00002032,
+	 */
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_TSC_SCALING;
+
+	/*
+	 *	PLE_GAP                         = 0x00004020,
+	 *	PLE_WINDOW                      = 0x00004022,
+	 */
+	*cpu_based_2nd_exec_ctrl &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+
+	/*
+	 *	VMX_PREEMPTION_TIMER_VALUE      = 0x0000482E,
+	 */
+	*pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
+
+	/*
+	 * Currently unsupported in KVM:
+	 *	GUEST_IA32_RTIT_CTL		= 0x00002814,
+	 */
+}
+#else /* !IS_ENABLED(CONFIG_HYPERV) */
+static inline void evmcs_write64(unsigned long field, u64 value) {}
+static inline void evmcs_write32(unsigned long field, u32 value) {}
+static inline void evmcs_write16(unsigned long field, u16 value) {}
+static inline u64 evmcs_read64(unsigned long field) { return 0; }
+static inline u32 evmcs_read32(unsigned long field) { return 0; }
+static inline u16 evmcs_read16(unsigned long field) { return 0; }
+static inline void vmcs_load_enlightened(u64 phys_addr) {}
+static void evmcs_sanitize_exec_ctrls(void) {}
+#endif /* IS_ENABLED(CONFIG_HYPERV) */
+
 static inline bool is_exception_n(u32 intr_info, u8 vector)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
@@ -1473,6 +1952,9 @@  static void vmcs_load(struct vmcs *vmcs)
 	u64 phys_addr = __pa(vmcs);
 	u8 error;
 
+	if (static_branch_unlikely(&enable_evmcs))
+		return vmcs_load_enlightened(phys_addr);
+
 	asm volatile (__ex(ASM_VMX_VMPTRLD_RAX) "; setna %0"
 			: "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
@@ -1646,18 +2128,24 @@  static __always_inline unsigned long __vmcs_readl(unsigned long field)
 static __always_inline u16 vmcs_read16(unsigned long field)
 {
 	vmcs_check16(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_read16(field);
 	return __vmcs_readl(field);
 }
 
 static __always_inline u32 vmcs_read32(unsigned long field)
 {
 	vmcs_check32(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_read32(field);
 	return __vmcs_readl(field);
 }
 
 static __always_inline u64 vmcs_read64(unsigned long field)
 {
 	vmcs_check64(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_read64(field);
 #ifdef CONFIG_X86_64
 	return __vmcs_readl(field);
 #else
@@ -1668,6 +2156,8 @@  static __always_inline u64 vmcs_read64(unsigned long field)
 static __always_inline unsigned long vmcs_readl(unsigned long field)
 {
 	vmcs_checkl(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_read64(field);
 	return __vmcs_readl(field);
 }
 
@@ -1691,18 +2181,27 @@  static __always_inline void __vmcs_writel(unsigned long field, unsigned long val
 static __always_inline void vmcs_write16(unsigned long field, u16 value)
 {
 	vmcs_check16(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write16(field, value);
+
 	__vmcs_writel(field, value);
 }
 
 static __always_inline void vmcs_write32(unsigned long field, u32 value)
 {
 	vmcs_check32(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write32(field, value);
+
 	__vmcs_writel(field, value);
 }
 
 static __always_inline void vmcs_write64(unsigned long field, u64 value)
 {
 	vmcs_check64(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write64(field, value);
+
 	__vmcs_writel(field, value);
 #ifndef CONFIG_X86_64
 	asm volatile ("");
@@ -1713,6 +2212,9 @@  static __always_inline void vmcs_write64(unsigned long field, u64 value)
 static __always_inline void vmcs_writel(unsigned long field, unsigned long value)
 {
 	vmcs_checkl(field);
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write64(field, value);
+
 	__vmcs_writel(field, value);
 }
 
@@ -1720,6 +2222,9 @@  static __always_inline void vmcs_clear_bits(unsigned long field, u32 mask)
 {
         BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0x2000,
 			 "vmcs_clear_bits does not support 64-bit fields");
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write32(field, evmcs_read32(field) & ~mask);
+
 	__vmcs_writel(field, __vmcs_readl(field) & ~mask);
 }
 
@@ -1727,6 +2232,9 @@  static __always_inline void vmcs_set_bits(unsigned long field, u32 mask)
 {
         BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0x2000,
 			 "vmcs_set_bits does not support 64-bit fields");
+	if (static_branch_unlikely(&enable_evmcs))
+		return evmcs_write32(field, evmcs_read32(field) | mask);
+
 	__vmcs_writel(field, __vmcs_readl(field) | mask);
 }
 
@@ -3596,6 +4104,14 @@  static int hardware_enable(void)
 	if (cr4_read_shadow() & X86_CR4_VMXE)
 		return -EBUSY;
 
+	/*
+	 * This can happen if we hot-added a CPU but failed to allocate
+	 * VP assist page for it.
+	 */
+	if (static_branch_unlikely(&enable_evmcs) &&
+	    !hv_get_vp_assist_page(cpu))
+		return -EFAULT;
+
 	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
 	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
 	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
@@ -3829,7 +4345,12 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	vmcs_conf->size = vmx_msr_high & 0x1fff;
 	vmcs_conf->order = get_order(vmcs_conf->size);
 	vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
-	vmcs_conf->revision_id = vmx_msr_low;
+
+	/* KVM supports Enlightened VMCS v1 only */
+	if (static_branch_unlikely(&enable_evmcs))
+		vmcs_conf->revision_id = KVM_EVMCS_VERSION;
+	else
+		vmcs_conf->revision_id = vmx_msr_low;
 
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
@@ -6990,6 +7511,17 @@  static __init int hardware_setup(void)
 		goto out;
 	}
 
+	if (static_branch_unlikely(&enable_evmcs)) {
+		evmcs_sanitize_exec_ctrls(&vmcs_config.cpu_based_2nd_exec_ctrl,
+					  &vmcs_config.pin_based_exec_ctrl);
+		/*
+		 * Enlightened VMCSv1 doesn't support these:
+		 *	GUEST_IA32_PERF_GLOBAL_CTRL	= 0x00002808,
+		 *	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
+		 */
+		cpu_has_load_perf_global_ctrl = false;
+	}
+
 	if (boot_cpu_has(X86_FEATURE_NX))
 		kvm_enable_efer_bits(EFER_NX);
 
@@ -8745,6 +9277,10 @@  static void dump_vmcs(void)
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
 
+	if (static_branch_unlikely(&enable_evmcs))
+		evmcs_sanitize_exec_ctrls(&secondary_exec_control,
+					  &pin_based_exec_ctrl);
+
 	pr_err("*** Guest State ***\n");
 	pr_err("CR0: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
 	       vmcs_readl(GUEST_CR0), vmcs_readl(CR0_READ_SHADOW),
@@ -8784,7 +9320,8 @@  static void dump_vmcs(void)
 	pr_err("DebugCtl = 0x%016llx  DebugExceptions = 0x%016lx\n",
 	       vmcs_read64(GUEST_IA32_DEBUGCTL),
 	       vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS));
-	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
+	if (cpu_has_load_perf_global_ctrl &&
+	    vmentry_ctl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
 		pr_err("PerfGlobCtl = 0x%016llx\n",
 		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
 	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
@@ -8820,7 +9357,8 @@  static void dump_vmcs(void)
 		pr_err("EFER = 0x%016llx  PAT = 0x%016llx\n",
 		       vmcs_read64(HOST_IA32_EFER),
 		       vmcs_read64(HOST_IA32_PAT));
-	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
+	if (cpu_has_load_perf_global_ctrl &&
+	    vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
 		pr_err("PerfGlobCtl = 0x%016llx\n",
 		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
 
@@ -9397,7 +9935,7 @@  static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned long cr3, cr4;
+	unsigned long cr3, cr4, evmcs_rsp;
 
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!enable_vnmi &&
@@ -9463,6 +10001,10 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		native_wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
 
 	vmx->__launched = vmx->loaded_vmcs->launched;
+
+	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
+		(unsigned long)&current_evmcs->host_rsp : 0;
+
 	asm(
 		/* Store host registers */
 		"push %%" _ASM_DX "; push %%" _ASM_BP ";"
@@ -9471,15 +10013,21 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		"cmp %%" _ASM_SP ", %c[host_rsp](%0) \n\t"
 		"je 1f \n\t"
 		"mov %%" _ASM_SP ", %c[host_rsp](%0) \n\t"
+		/* Avoid VMWRITE when Enlightened VMCS is in use */
+		"test %%" _ASM_SI ", %%" _ASM_SI " \n\t"
+		"jz 2f \n\t"
+		"mov %%" _ASM_SP ", (%%" _ASM_SI ") \n\t"
+		"jmp 1f \n\t"
+		"2: \n\t"
 		__ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
 		"1: \n\t"
 		/* Reload cr2 if changed */
 		"mov %c[cr2](%0), %%" _ASM_AX " \n\t"
 		"mov %%cr2, %%" _ASM_DX " \n\t"
 		"cmp %%" _ASM_AX ", %%" _ASM_DX " \n\t"
-		"je 2f \n\t"
+		"je 3f \n\t"
 		"mov %%" _ASM_AX", %%cr2 \n\t"
-		"2: \n\t"
+		"3: \n\t"
 		/* Check if vmlaunch of vmresume is needed */
 		"cmpl $0, %c[launched](%0) \n\t"
 		/* Load guest registers.  Don't clobber flags. */
@@ -9548,7 +10096,7 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		".global vmx_return \n\t"
 		"vmx_return: " _ASM_PTR " 2b \n\t"
 		".popsection"
-	      : : "c"(vmx), "d"((unsigned long)HOST_RSP),
+	      : : "c"(vmx), "d"((unsigned long)HOST_RSP), "S"(evmcs_rsp),
 		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
 		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
 		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),
@@ -9573,10 +10121,10 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		[wordsize]"i"(sizeof(ulong))
 	      : "cc", "memory"
 #ifdef CONFIG_X86_64
-		, "rax", "rbx", "rdi", "rsi"
+		, "rax", "rbx", "rdi"
 		, "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
 #else
-		, "eax", "ebx", "edi", "esi"
+		, "eax", "ebx", "edi"
 #endif
 	      );
 
@@ -9604,6 +10152,11 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
+	/* All fields are clean at this point */
+	if (static_branch_unlikely(&enable_evmcs))
+		current_evmcs->hv_clean_fields |=
+			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
+
 	/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
 	if (vmx->host_debugctlmsr)
 		update_debugctlmsr(vmx->host_debugctlmsr);
@@ -12419,7 +12972,36 @@  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 
 static int __init vmx_init(void)
 {
-	int r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
+	int r;
+
+#if IS_ENABLED(CONFIG_HYPERV)
+	/*
+	 * Enlightened VMCS usage should be recommended and the host needs
+	 * to support eVMCS v1 or above. We can also disable eVMCS support
+	 * with module parameter.
+	 */
+	if (enlightened_vmcs &&
+	    ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
+	    (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
+	    KVM_EVMCS_VERSION) {
+		int cpu;
+
+		/* Check that we have assist pages on all online CPUs */
+		for_each_online_cpu(cpu) {
+			if (!hv_get_vp_assist_page(cpu)) {
+				enlightened_vmcs = false;
+				break;
+			}
+		}
+
+		if (enlightened_vmcs) {
+			pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
+			static_branch_enable(&enable_evmcs);
+		}
+	}
+#endif
+
+	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
                      __alignof__(struct vcpu_vmx), THIS_MODULE);
 	if (r)
 		return r;
@@ -12440,6 +13022,29 @@  static void __exit vmx_exit(void)
 #endif
 
 	kvm_exit();
+
+#if IS_ENABLED(CONFIG_HYPERV)
+	if (static_branch_unlikely(&enable_evmcs)) {
+		int cpu;
+		struct hv_vp_assist_page *vp_ap;
+		/*
+		 * Reset everything to support using non-enlightened VMCS
+		 * access later (e.g. when we reload the module with
+		 * enlightened_vmcs=0)
+		 */
+		for_each_online_cpu(cpu) {
+			vp_ap =	hv_get_vp_assist_page(cpu);
+
+			if (!vp_ap)
+				continue;
+
+			vp_ap->current_nested_vmcs = 0;
+			vp_ap->enlighten_vmentry = 0;
+		}
+
+		static_branch_disable(&enable_evmcs);
+	}
+#endif
 }
 
 module_init(vmx_init)