diff mbox series

[3/4] KVM/VMX: Invoke NMI non-IST entry instead of IST entry

Message ID 20210426230949.3561-4-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series x86: Don't invoke asm_exc_nmi() on the kernel stack | expand

Commit Message

Lai Jiangshan April 26, 2021, 11:09 p.m. UTC
From: Lai Jiangshan <laijs@linux.alibaba.com>

In VMX, the NMI handler needs to be invoked after NMI VM-Exit.

Before the commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via
indirect call instead of INTn"), the work is done by INTn ("int $2").

But INTn microcode is relatively expensive, so the commit reworked
NMI VM-Exit handling to invoke the kernel handler by function call.
And INTn doesn't set the NMI blocked flag required by the linux kernel
NMI entry.  So moving away from INTn are very reasonable.

Yet some details were missed.  After the said commit applied, the NMI
entry pointer is fetched from the IDT table and called from the kernel
stack.  But the NMI entry pointer installed on the IDT table is
asm_exc_nmi() which expects to be invoked on the IST stack by the ISA.
And it relies on the "NMI executing" variable on the IST stack to work
correctly.  When it is unexpectedly called from the kernel stack, the
RSP-located "NMI executing" variable is also on the kernel stack and
is "uninitialized" and can cause the NMI entry to run in the wrong way.

So we should not used the NMI entry installed on the IDT table.  Rather,
we should use the NMI entry allowed to be used on the kernel stack which
is asm_noist_exc_nmi() which is also used for XENPV and early booting.

Link: https://lore.kernel.org/lkml/20200915191505.10355-3-sean.j.christopherson@intel.com/
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: kvm@vger.kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kernel/nmi.c  | 3 +++
 arch/x86/kvm/vmx/vmx.c | 8 ++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

Comments

Lai Jiangshan April 30, 2021, 2:46 a.m. UTC | #1
On 2021/4/27 07:09, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> In VMX, the NMI handler needs to be invoked after NMI VM-Exit.
> 
> Before the commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via
> indirect call instead of INTn"), the work is done by INTn ("int $2").
> 
> But INTn microcode is relatively expensive, so the commit reworked
> NMI VM-Exit handling to invoke the kernel handler by function call.
> And INTn doesn't set the NMI blocked flag required by the linux kernel
> NMI entry.  So moving away from INTn are very reasonable.
> 
> Yet some details were missed.  After the said commit applied, the NMI
> entry pointer is fetched from the IDT table and called from the kernel
> stack.  But the NMI entry pointer installed on the IDT table is
> asm_exc_nmi() which expects to be invoked on the IST stack by the ISA.
> And it relies on the "NMI executing" variable on the IST stack to work
> correctly.  When it is unexpectedly called from the kernel stack, the
> RSP-located "NMI executing" variable is also on the kernel stack and
> is "uninitialized" and can cause the NMI entry to run in the wrong way.
> 
> So we should not used the NMI entry installed on the IDT table.  Rather,
> we should use the NMI entry allowed to be used on the kernel stack which
> is asm_noist_exc_nmi() which is also used for XENPV and early booting.
> 

The problem can be tested by the following testing-patch.

1) the testing-patch can be applied without conflict before this patch 3.
    And it shows the problem that the NMI is missed in the case.

2) you need to manually copy the same logic of this testing-patch to verify
    this patch 3. And it shows that the problem is fixed.

3) the only one line of code in vmenter.S just emulates the situation that
    a "uninitialized" garbage in the kernel stack happens to be 1 and it happens
    to be at the same location of the RSP-located "NMI executing" variable.


diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 3a6461694fc2..32096049c2a2 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -316,6 +316,7 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
  #endif
  	pushf
  	push $__KERNEL_CS
+	movq $1, -24(%rsp) // "NMI executing": 1 = nested, non-1 = not-nested
  	CALL_NOSPEC _ASM_ARG1

  	/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bcbf0d2139e9..9509d2edd50d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6416,8 +6416,12 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
  	else if (is_machine_check(intr_info))
  		kvm_machine_check();
  	/* We need to handle NMIs before interrupts are enabled */
-	else if (is_nmi(intr_info))
+	else if (is_nmi(intr_info)) {
+		unsigned long count = this_cpu_read(irq_stat.__nmi_count);
  		handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info);
+		if (count == this_cpu_read(irq_stat.__nmi_count))
+			pr_info("kvm nmi miss\n");
+	}
  }

  static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
Thomas Gleixner May 3, 2021, 7:37 p.m. UTC | #2
On Tue, Apr 27 2021 at 07:09, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> In VMX, the NMI handler needs to be invoked after NMI VM-Exit.
>
> Before the commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via
> indirect call instead of INTn"), the work is done by INTn ("int $2").
>
> But INTn microcode is relatively expensive, so the commit reworked
> NMI VM-Exit handling to invoke the kernel handler by function call.
> And INTn doesn't set the NMI blocked flag required by the linux kernel
> NMI entry.  So moving away from INTn are very reasonable.
>
> Yet some details were missed.  After the said commit applied, the NMI
> entry pointer is fetched from the IDT table and called from the kernel
> stack.  But the NMI entry pointer installed on the IDT table is
> asm_exc_nmi() which expects to be invoked on the IST stack by the ISA.
> And it relies on the "NMI executing" variable on the IST stack to work
> correctly.  When it is unexpectedly called from the kernel stack, the
> RSP-located "NMI executing" variable is also on the kernel stack and
> is "uninitialized" and can cause the NMI entry to run in the wrong way.
>
> So we should not used the NMI entry installed on the IDT table.  Rather,
> we should use the NMI entry allowed to be used on the kernel stack which
> is asm_noist_exc_nmi() which is also used for XENPV and early booting.

It's not used by XENPV. XENPV only uses the C entry point, but the ASM
entry is separate.

Thanks,

        tglx
Thomas Gleixner May 3, 2021, 8:02 p.m. UTC | #3
On Tue, Apr 27 2021 at 07:09, Lai Jiangshan wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index bcbf0d2139e9..96e59d912637 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -36,6 +36,7 @@
>  #include <asm/debugreg.h>
>  #include <asm/desc.h>
>  #include <asm/fpu/internal.h>
> +#include <asm/idtentry.h>
>  #include <asm/io.h>
>  #include <asm/irq_remapping.h>
>  #include <asm/kexec.h>
> @@ -6416,8 +6417,11 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
>  	else if (is_machine_check(intr_info))
>  		kvm_machine_check();
>  	/* We need to handle NMIs before interrupts are enabled */
> -	else if (is_nmi(intr_info))
> -		handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info);
> +	else if (is_nmi(intr_info)) {

Lacks curly braces for all of the above conditions according to coding style.

> +		kvm_before_interrupt(&vmx->vcpu);
> +		vmx_do_interrupt_nmi_irqoff((unsigned long)asm_noist_exc_nmi);
> +		kvm_after_interrupt(&vmx->vcpu);
> +	}

but this and the next patch are not really needed. The below avoids the
extra kvm_before/after() dance in both places. Hmm?

Thanks,

        tglx
---
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -526,6 +526,10 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 DEFINE_IDTENTRY_RAW_ALIAS(exc_nmi, exc_nmi_noist);
 
+#if IS_MODULE(CONFIG_KVM_INTEL)
+EXPORT_SYMBOL_GPL(asm_exc_nmi_noist);
+#endif
+
 void stop_nmi(void)
 {
 	ignore_nmis++;
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -36,6 +36,7 @@
 #include <asm/debugreg.h>
 #include <asm/desc.h>
 #include <asm/fpu/internal.h>
+#include <asm/idtentry.h>
 #include <asm/io.h>
 #include <asm/irq_remapping.h>
 #include <asm/kexec.h>
@@ -6395,18 +6396,17 @@ static void vmx_apicv_post_state_restore
 
 void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
+static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+					unsigned long entry)
 {
-	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-	gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
 	kvm_before_interrupt(vcpu);
-	vmx_do_interrupt_nmi_irqoff(gate_offset(desc));
+	vmx_do_interrupt_nmi_irqoff(entry);
 	kvm_after_interrupt(vcpu);
 }
 
 static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
 {
+	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
 	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
 
 	/* if exit due to PF check for async PF */
@@ -6417,18 +6417,20 @@ static void handle_exception_nmi_irqoff(
 		kvm_machine_check();
 	/* We need to handle NMIs before interrupts are enabled */
 	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info);
+		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
 }
 
 static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 {
 	u32 intr_info = vmx_get_intr_info(vcpu);
+	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+	gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
 	if (WARN_ONCE(!is_external_intr(intr_info),
 	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
-	handle_interrupt_nmi_irqoff(vcpu, intr_info);
+	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
 }
 
 static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
Paolo Bonzini May 4, 2021, 8:10 a.m. UTC | #4
On 03/05/21 22:02, Thomas Gleixner wrote:
> but this and the next patch are not really needed. The below avoids the
> extra kvm_before/after() dance in both places. Hmm?

Sure, that's good as well.

Paolo

> Thanks,
> 
>          tglx
> ---
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -526,6 +526,10 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>   
>   DEFINE_IDTENTRY_RAW_ALIAS(exc_nmi, exc_nmi_noist);
>   
> +#if IS_MODULE(CONFIG_KVM_INTEL)
> +EXPORT_SYMBOL_GPL(asm_exc_nmi_noist);
> +#endif
> +
>   void stop_nmi(void)
>   {
>   	ignore_nmis++;
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -36,6 +36,7 @@
>   #include <asm/debugreg.h>
>   #include <asm/desc.h>
>   #include <asm/fpu/internal.h>
> +#include <asm/idtentry.h>
>   #include <asm/io.h>
>   #include <asm/irq_remapping.h>
>   #include <asm/kexec.h>
> @@ -6395,18 +6396,17 @@ static void vmx_apicv_post_state_restore
>   
>   void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
>   
> -static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
> +					unsigned long entry)
>   {
> -	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> -	gate_desc *desc = (gate_desc *)host_idt_base + vector;
> -
>   	kvm_before_interrupt(vcpu);
> -	vmx_do_interrupt_nmi_irqoff(gate_offset(desc));
> +	vmx_do_interrupt_nmi_irqoff(entry);
>   	kvm_after_interrupt(vcpu);
>   }
>   
>   static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
>   {
> +	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
>   	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
>   
>   	/* if exit due to PF check for async PF */
> @@ -6417,18 +6417,20 @@ static void handle_exception_nmi_irqoff(
>   		kvm_machine_check();
>   	/* We need to handle NMIs before interrupts are enabled */
>   	else if (is_nmi(intr_info))
> -		handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info);
> +		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
>   }
>   
>   static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
>   {
>   	u32 intr_info = vmx_get_intr_info(vcpu);
> +	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> +	gate_desc *desc = (gate_desc *)host_idt_base + vector;
>   
>   	if (WARN_ONCE(!is_external_intr(intr_info),
>   	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
>   		return;
>   
> -	handle_interrupt_nmi_irqoff(vcpu, intr_info);
> +	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
>   }
>   
>   static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
diff mbox series

Patch

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 2fb1fd59d714..919f0400d931 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -528,10 +528,13 @@  DEFINE_IDTENTRY_RAW(noist_exc_nmi)
 {
 	/*
 	 * On Xen PV and early booting stage, NMI doesn't use IST.
+	 * And when it is manually called from VMX NMI VM-Exit handler,
+	 * it doesn't use IST either.
 	 * The C part is the same as native.
 	 */
 	exc_nmi(regs);
 }
+EXPORT_SYMBOL_GPL(asm_noist_exc_nmi);
 
 void stop_nmi(void)
 {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bcbf0d2139e9..96e59d912637 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -36,6 +36,7 @@ 
 #include <asm/debugreg.h>
 #include <asm/desc.h>
 #include <asm/fpu/internal.h>
+#include <asm/idtentry.h>
 #include <asm/io.h>
 #include <asm/irq_remapping.h>
 #include <asm/kexec.h>
@@ -6416,8 +6417,11 @@  static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
 	else if (is_machine_check(intr_info))
 		kvm_machine_check();
 	/* We need to handle NMIs before interrupts are enabled */
-	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info);
+	else if (is_nmi(intr_info)) {
+		kvm_before_interrupt(&vmx->vcpu);
+		vmx_do_interrupt_nmi_irqoff((unsigned long)asm_noist_exc_nmi);
+		kvm_after_interrupt(&vmx->vcpu);
+	}
 }
 
 static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)