[v2,17/18] KVM: nVMX: add option to perform early consistency checks via H/W
diff mbox series

Message ID 20180828160459.14093-18-sean.j.christopherson@intel.com
State New
Headers show
Series
  • KVM: nVMX: add option to perform early consistency checks via H/W
Related show

Commit Message

Sean Christopherson Aug. 28, 2018, 4:04 p.m. UTC
KVM defers many VMX consistency checks to the CPU, ostensibly for
performance reasons[1], including checks that result in VMFail (as
opposed to VMExit).  This behavior may be undesirable for some users
since this means KVM detects certain classes of VMFail only after it
has processed guest state, e.g. emulated MSR load-on-entry.  Because
there is a strict ordering between checks that cause VMFail and those
that cause VMExit, i.e. all VMFail checks are performed before any
checks that cause VMExit, we can detect all VMFail conditions via a
dry run of sorts.

After preparing vmcs02 with all state needed to pass the VMFail
consistency checks, optionally do a "test" VMEnter with an invalid
GUEST_RIP.  If the VMEnter results in a VMExit (due to bad guest
state), then we can safely say that the nested VMEnter should not
VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must
be due to an L0 bug.

Unfortunately, since the "passing" case causes a VMExit, KVM must
be extra diligent to ensure that host state is restored, e.g. DR7
and RFLAGS are reset on VMExit.  Failure to restore RFLAGS.IF is
particularly fatal.

And of course the extra VMEnter and VMExit impacts performance.
The raw overhead of the early consistency checks is ~6% on modern
hardware (though this could easily vary based on configuration),
while the added latency observed from the L1 VMM is ~10%.  The
early consistency checks do not occur in a vacuum, e.g. spending
more time in L0 can lead to more interrupts being serviced while
emulating VMEnter, thereby increasing the latency observed by L1.

Add a module param, early_consistency_checks, to provide control
over whether or not VMX performs the early consistency checks.
In addition to standard on/off behavior, the param accepts a value
of -1, which is essentialy an "auto" setting whereby KVM does
the early checks only when it thinks it's running on bare metal.
When running nested, doing early checks is of dubious value since
the resulting behavior is heavily dependent on L0.  In the future,
the "auto" setting could also be used to default to skipping the
early hardware checks for certain configurations/platforms if KVM
reaches a state where it has 100% coverage of VMFail conditions.

[1] To my knowledge no one has implemented and tested full software
    emulation of the VMFail consistency checks.  Until that happens,
    one can only speculate about the actual performance overhead of
    doing all VMFail consistency checks in software.  Obviously any
    code is slower than no code, but in the grand scheme of nested
    virtualization it's entirely possible the overhead is negligible.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx.c | 151 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 146 insertions(+), 5 deletions(-)

Comments

kbuild test robot Aug. 30, 2018, 5:48 p.m. UTC | #1
Hi Sean,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/linux-next]
[also build test WARNING on v4.19-rc1 next-20180830]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Sean-Christopherson/KVM-nVMX-add-option-to-perform-early-consistency-checks-via-H-W/20180830-231534
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: i386-randconfig-x076-201834 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   arch/x86/kvm/vmx.c: In function 'nested_vmx_check_vmentry_hw':
>> arch/x86/kvm/vmx.c:12595:25: warning: large integer implicitly truncated to unsigned type [-Woverflow]
     vmcs_writel(GUEST_RIP, 0xf0f0ULL << 48);
                            ^~~~~~~~~

vim +12595 arch/x86/kvm/vmx.c

 12573	
 12574	static int __noclone nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
 12575	{
 12576		struct vcpu_vmx *vmx = to_vmx(vcpu);
 12577		unsigned long cr3, cr4;
 12578	
 12579		if (!early_consistency_checks)
 12580			return 0;
 12581	
 12582		if (vmx->msr_autoload.host.nr)
 12583			vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 12584		if (vmx->msr_autoload.guest.nr)
 12585			vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
 12586	
 12587		preempt_disable();
 12588	
 12589		vmx_prepare_switch_to_guest(vcpu);
 12590	
 12591		/*
 12592		 * prepare_vmcs02() writes GUEST_RIP unconditionally, no need
 12593		 * to save/restore or set dirty bits.
 12594		 */
 12595		vmcs_writel(GUEST_RIP, 0xf0f0ULL << 48);
 12596	
 12597		vmcs_writel(HOST_RIP, vmx_early_consistency_check_return);
 12598	
 12599		cr3 = __get_current_cr3_fast();
 12600		if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
 12601			vmcs_writel(HOST_CR3, cr3);
 12602			vmx->loaded_vmcs->host_state.cr3 = cr3;
 12603		}
 12604	
 12605		cr4 = cr4_read_shadow();
 12606		if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
 12607			vmcs_writel(HOST_CR4, cr4);
 12608			vmx->loaded_vmcs->host_state.cr4 = cr4;
 12609		}
 12610	
 12611		vmx->__launched = vmx->loaded_vmcs->launched;
 12612	
 12613		asm(
 12614			/* Set HOST_RSP */
 12615			__ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
 12616			"mov %%" _ASM_SP ", %c[host_rsp](%0)\n\t"
 12617	
 12618			/* Check if vmlaunch of vmresume is needed */
 12619			"cmpl $0, %c[launched](%0)\n\t"
 12620			"je 1f\n\t"
 12621			__ex(ASM_VMX_VMRESUME) "\n\t"
 12622			"jmp 2f\n\t"
 12623			"1: " __ex(ASM_VMX_VMLAUNCH) "\n\t"
 12624			"jmp 2f\n\t"
 12625			"2: "
 12626	
 12627			/* Set vmx->fail accordingly */
 12628			"setbe %c[fail](%0)\n\t"
 12629	
 12630			".pushsection .rodata\n\t"
 12631			".global vmx_early_consistency_check_return\n\t"
 12632			"vmx_early_consistency_check_return: " _ASM_PTR " 2b\n\t"
 12633			".popsection"
 12634		      :
 12635		      : "c"(vmx), "d"((unsigned long)HOST_RSP),
 12636			[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
 12637			[fail]"i"(offsetof(struct vcpu_vmx, fail)),
 12638			[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp))
 12639		      : "rax", "cc", "memory"
 12640		);
 12641	
 12642		vmcs_writel(HOST_RIP, vmx_return);
 12643	
 12644		preempt_enable();
 12645	
 12646		if (vmx->msr_autoload.host.nr)
 12647			vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
 12648		if (vmx->msr_autoload.guest.nr)
 12649			vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
 12650	
 12651		if (vmx->fail) {
 12652			WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) !=
 12653				     VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 12654			vmx->fail = 0;
 12655			return 1;
 12656		}
 12657	
 12658		/*
 12659		 * VMExit clears RFLAGS.IF and DR7, even on a consistency check.
 12660		 */
 12661		local_irq_enable();
 12662		if (hw_breakpoint_active())
 12663			set_debugreg(__this_cpu_read(cpu_dr7), 7);
 12664	
 12665		/*
 12666		 * A non-failing VMEntry means we somehow entered guest mode with
 12667		 * an illegal RIP, and that's just the tip of the iceberg.  There
 12668		 * is no telling what memory has been modified or what state has
 12669		 * been exposed to unknown code.  Hitting this all but guarantees
 12670		 * a (very critical) hardware issue.
 12671		 */
 12672		WARN_ON(!(vmcs_read32(VM_EXIT_REASON) &
 12673			VMX_EXIT_REASONS_FAILED_VMENTRY));
 12674	
 12675		return 0;
 12676	}
 12677	STACK_FRAME_NON_STANDARD(nested_vmx_check_vmentry_hw);
 12678	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Patch
diff mbox series

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3d826550cc9d..311cd32ab584 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -54,6 +54,7 @@ 
 #include <asm/mmu_context.h>
 #include <asm/spec-ctrl.h>
 #include <asm/mshyperv.h>
+#include <asm/hypervisor.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -110,6 +111,15 @@  module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO);
 static bool __read_mostly nested = 0;
 module_param(nested, bool, S_IRUGO);
 
+/*
+ * early_consistency_checks is a tri-state:
+ *	-1 == "auto"
+ *	 0 == "never"
+ *	 1 == "always".
+ */
+static int __read_mostly early_consistency_checks = -1;
+module_param(early_consistency_checks, int, 0444);
+
 static u64 __read_mostly host_xss;
 
 static bool __read_mostly enable_pml = 1;
@@ -188,6 +198,7 @@  static unsigned int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 module_param(ple_window_max, uint, 0444);
 
 extern const ulong vmx_return;
+extern const ulong vmx_early_consistency_check_return;
 
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
@@ -7904,6 +7915,10 @@  static __init int hardware_setup(void)
 	if (!cpu_has_virtual_nmis())
 		enable_vnmi = 0;
 
+	if (early_consistency_checks < 0 &&
+	    !hypervisor_is_type(X86_HYPER_NATIVE))
+		early_consistency_checks = 0;
+
 	/*
 	 * set_apic_access_page_addr() is used to reload apic access
 	 * page upon invalidation.  No need to do anything if not
@@ -11883,6 +11898,14 @@  static void prepare_vmcs02_first_run(struct vcpu_vmx *vmx)
 	if (vmx->nested.vmcs02.launched)
 		return;
 
+	/*
+	 * We don't care what the EPTP value is we just need to guarantee
+	 * it's valid so we don't get a false positive when doing early
+	 * consistency checks.
+	 */
+	if (enable_ept && early_consistency_checks)
+		vmcs_write64(EPT_POINTER, construct_eptp(&vmx->vcpu, 0));
+
 	/* All VMFUNCs are currently emulated through L0 vmexits.  */
 	if (cpu_has_vmx_vmfunc())
 		vmcs_write64(VM_FUNCTION_CONTROL, 0);
@@ -11936,7 +11959,9 @@  static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	 * entry, but only if the current (host) sp changed from the value
 	 * we wrote last (vmx->host_rsp).  This cache is no longer relevant
 	 * if we switch vmcs, and rather than hold a separate cache per vmcs,
-	 * here we just force the write to happen on entry.
+	 * here we just force the write to happen on entry.  host_rsp will
+	 * also be written unconditionally by nested_vmx_check_vmentry_hw()
+	 * if we are doing early consistency checks via hardware.
 	 */
 	vmx->host_rsp = 0;
 
@@ -12549,11 +12574,121 @@  static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	return 0;
 }
 
+static int __noclone nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned long cr3, cr4;
+
+	if (!early_consistency_checks)
+		return 0;
+
+	if (vmx->msr_autoload.host.nr)
+		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
+	if (vmx->msr_autoload.guest.nr)
+		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
+
+	preempt_disable();
+
+	vmx_prepare_switch_to_guest(vcpu);
+
+	/*
+	 * prepare_vmcs02() writes GUEST_RIP unconditionally, no need
+	 * to save/restore or set dirty bits.
+	 */
+	vmcs_writel(GUEST_RIP, 0xf0f0ULL << 48);
+
+	vmcs_writel(HOST_RIP, vmx_early_consistency_check_return);
+
+	cr3 = __get_current_cr3_fast();
+	if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
+		vmcs_writel(HOST_CR3, cr3);
+		vmx->loaded_vmcs->host_state.cr3 = cr3;
+	}
+
+	cr4 = cr4_read_shadow();
+	if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
+		vmcs_writel(HOST_CR4, cr4);
+		vmx->loaded_vmcs->host_state.cr4 = cr4;
+	}
+
+	vmx->__launched = vmx->loaded_vmcs->launched;
+
+	asm(
+		/* Set HOST_RSP */
+		__ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
+		"mov %%" _ASM_SP ", %c[host_rsp](%0)\n\t"
+
+		/* Check if vmlaunch of vmresume is needed */
+		"cmpl $0, %c[launched](%0)\n\t"
+		"je 1f\n\t"
+		__ex(ASM_VMX_VMRESUME) "\n\t"
+		"jmp 2f\n\t"
+		"1: " __ex(ASM_VMX_VMLAUNCH) "\n\t"
+		"jmp 2f\n\t"
+		"2: "
+
+		/* Set vmx->fail accordingly */
+		"setbe %c[fail](%0)\n\t"
+
+		".pushsection .rodata\n\t"
+		".global vmx_early_consistency_check_return\n\t"
+		"vmx_early_consistency_check_return: " _ASM_PTR " 2b\n\t"
+		".popsection"
+	      :
+	      : "c"(vmx), "d"((unsigned long)HOST_RSP),
+		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
+		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
+		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp))
+	      : "rax", "cc", "memory"
+	);
+
+	vmcs_writel(HOST_RIP, vmx_return);
+
+	preempt_enable();
+
+	if (vmx->msr_autoload.host.nr)
+		vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
+	if (vmx->msr_autoload.guest.nr)
+		vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
+
+	if (vmx->fail) {
+		WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) !=
+			     VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		vmx->fail = 0;
+		return 1;
+	}
+
+	/*
+	 * VMExit clears RFLAGS.IF and DR7, even on a consistency check.
+	 */
+	local_irq_enable();
+	if (hw_breakpoint_active())
+		set_debugreg(__this_cpu_read(cpu_dr7), 7);
+
+	/*
+	 * A non-failing VMEntry means we somehow entered guest mode with
+	 * an illegal RIP, and that's just the tip of the iceberg.  There
+	 * is no telling what memory has been modified or what state has
+	 * been exposed to unknown code.  Hitting this all but guarantees
+	 * a (very critical) hardware issue.
+	 */
+	WARN_ON(!(vmcs_read32(VM_EXIT_REASON) &
+		VMX_EXIT_REASONS_FAILED_VMENTRY));
+
+	return 0;
+}
+STACK_FRAME_NON_STANDARD(nested_vmx_check_vmentry_hw);
+
 static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12);
 /*
  * If exit_qual is NULL, this is being called from state restore (either RSM
  * or KVM_SET_NESTED_STATE).  Otherwise it's called from vmlaunch/vmresume.
+ *
+ * Returns:
+ *   0 - success, i.e. proceed with actual VMEnter
+ *   1 - consistency check VMExit
+ *  -1 - consistency check VMFail
  */
 static int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 					  bool from_vmentry)
@@ -12573,6 +12708,11 @@  static int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	if (from_vmentry) {
 		nested_get_vmcs12_pages(vcpu);
 
+		if (nested_vmx_check_vmentry_hw(vcpu)) {
+			vmx_switch_vmcs(vcpu, &vmx->vmcs01);
+			return -1;
+		}
+
 		if (check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
 			goto consistency_check_vmexit;
 	}
@@ -12700,13 +12840,14 @@  static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	 * We're finally done with prerequisite checking, and can start with
 	 * the nested entry.
 	 */
-
 	vmx->nested.nested_run_pending = 1;
 	ret = nested_vmx_enter_non_root_mode(vcpu, true);
-	if (ret) {
-		vmx->nested.nested_run_pending = 0;
+	vmx->nested.nested_run_pending = !ret;
+	if (ret > 0)
 		return 1;
-	}
+	else if (ret)
+		return nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 
 	/* Hide L1D cache contents from the nested guest.  */
 	vmx->vcpu.arch.l1tf_flush_l1d = true;