KVM: nVMX: Fix loss of pending event before entering L2

Message ID	1535557387-7901-1-git-send-email-liran.alon@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Liran Alon <liran.alon@oracle.com> To: pbonzini@redhat.com, rkrcmar@redhat.com, kvm@vger.kernel.org Cc: idan.brown@oracle.com, Liran Alon <liran.alon@oracle.com> Subject: [PATCH] KVM: nVMX: Fix loss of pending event before entering L2 Date: Wed, 29 Aug 2018 18:43:07 +0300 Message-Id: <1535557387-7901-1-git-send-email-liran.alon@oracle.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	KVM: nVMX: Fix loss of pending event before entering L2 \| expand KVM: nVMX: Fix loss of pending event before entering L2

Liran Alon Aug. 29, 2018, 3:43 p.m. UTC

Consider the case L1 had a pending event until it executed
VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
(e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
L0 needs to evaluate if this pending event should cause an exit from
L2 to L1 or delivered directly to L2 (In case L1 don't intercept
EXTERNAL_INTERRUPT).

Usually this would be handled by L0 requesting a window (e.g. IRQ
window) by setting VMCS accordingly. However, this setting was done on
VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
requesting a KVM_REQ_EVENT.

Note that above scenario exists when L1 KVM is about to enter L2 but
requests an "immediate-exit". As in this case, L1 will
disable-interrupts and then send a self-IPI before entering L2.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 arch/x86/kvm/vmx.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Radim Krčmář Aug. 29, 2018, 4:18 p.m. UTC | #1

2018-08-29 18:43+0300, Liran Alon:
> Consider the case L1 had a pending event until it executed
> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> L0 needs to evaluate if this pending event should cause an exit from
> L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> EXTERNAL_INTERRUPT).
> 
> Usually this would be handled by L0 requesting a window (e.g. IRQ
> window) by setting VMCS accordingly. However, this setting was done on
> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> requesting a KVM_REQ_EVENT.
> 
> Note that above scenario exists when L1 KVM is about to enter L2 but
> requests an "immediate-exit". As in this case, L1 will
> disable-interrupts and then send a self-IPI before entering L2.

Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".

Do you have a test for this?

Thanks.

Sean Christopherson Aug. 29, 2018, 4:34 p.m. UTC | #2

On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
> 2018-08-29 18:43+0300, Liran Alon:
> > Consider the case L1 had a pending event until it executed
> > VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> > (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> > L0 needs to evaluate if this pending event should cause an exit from
> > L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> > EXTERNAL_INTERRUPT).
> > 
> > Usually this would be handled by L0 requesting a window (e.g. IRQ
> > window) by setting VMCS accordingly. However, this setting was done on
> > VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> > VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> > requesting a KVM_REQ_EVENT.
> > 
> > Note that above scenario exists when L1 KVM is about to enter L2 but
> > requests an "immediate-exit". As in this case, L1 will
> > disable-interrupts and then send a self-IPI before entering L2.
> 
> Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".

Please hold off on doing anything with this, I don't think this is the
correct fix.  I have a half-finished response to the preemption timer
thread that prompted this patch, I'll get that sent ASAP. 

> Do you have a test for this?

Run the preemption timer KVM unit test in L1 or L2.

> Thanks.

Radim Krčmář Aug. 29, 2018, 4:39 p.m. UTC | #3

2018-08-29 09:34-0700, Sean Christopherson:
> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
> > 2018-08-29 18:43+0300, Liran Alon:
> > > Consider the case L1 had a pending event until it executed
> > > VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> > > (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> > > L0 needs to evaluate if this pending event should cause an exit from
> > > L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> > > EXTERNAL_INTERRUPT).
> > > 
> > > Usually this would be handled by L0 requesting a window (e.g. IRQ
> > > window) by setting VMCS accordingly. However, this setting was done on
> > > VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> > > VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> > > requesting a KVM_REQ_EVENT.
> > > 
> > > Note that above scenario exists when L1 KVM is about to enter L2 but
> > > requests an "immediate-exit". As in this case, L1 will
> > > disable-interrupts and then send a self-IPI before entering L2.
> > 
> > Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".
> 
> Please hold off on doing anything with this, I don't think this is the
> correct fix.  I have a half-finished response to the preemption timer
> thread that prompted this patch, I'll get that sent ASAP. 

Sure, thanks for the heads-up.

Liran Alon Aug. 29, 2018, 5:12 p.m. UTC | #4

> On 29 Aug 2018, at 19:39, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 
> 2018-08-29 09:34-0700, Sean Christopherson:
>> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
>>> 2018-08-29 18:43+0300, Liran Alon:
>>>> Consider the case L1 had a pending event until it executed
>>>> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
>>>> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
>>>> L0 needs to evaluate if this pending event should cause an exit from
>>>> L2 to L1 or delivered directly to L2 (In case L1 don't intercept
>>>> EXTERNAL_INTERRUPT).
>>>> 
>>>> Usually this would be handled by L0 requesting a window (e.g. IRQ
>>>> window) by setting VMCS accordingly. However, this setting was done on
>>>> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
>>>> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
>>>> requesting a KVM_REQ_EVENT.
>>>> 
>>>> Note that above scenario exists when L1 KVM is about to enter L2 but
>>>> requests an "immediate-exit". As in this case, L1 will
>>>> disable-interrupts and then send a self-IPI before entering L2.
>>> 
>>> Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".
>> 
>> Please hold off on doing anything with this, I don't think this is the
>> correct fix.  I have a half-finished response to the preemption timer
>> thread that prompted this patch, I'll get that sent ASAP. 
> 
> Sure, thanks for the heads-up.

Sean, I think this is orthogonal to the “immediate-exit” mechanism implementation issue
you suggest to replace with preemption-timer with interval of 0 instead of self-IPI.

In my opinion, this patch handles a general issue of losing pending interrupt queued 
(And disallowed from being dispatched) in L1 before entering L2. This is not just related
to immediate-exit mechanism. This is also true for example for a timer-interrupt that may be
raised L1 during the timespan in which L1 disables interrupts until he VMRESUME into L2.

I have actually written a small effective kvm-unit-test for this. It fails before this patch and passes after it.
I will submit the unit-test and Cc you guys.

-Liran

Sean Christopherson Aug. 29, 2018, 6:34 p.m. UTC | #5

On Wed, Aug 29, 2018 at 08:12:43PM +0300, Liran Alon wrote:
> 
> > On 29 Aug 2018, at 19:39, Radim Krčmář <rkrcmar@redhat.com> wrote:
> > 
> > 2018-08-29 09:34-0700, Sean Christopherson:
> >> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
> >>> 2018-08-29 18:43+0300, Liran Alon:
> >>>> Consider the case L1 had a pending event until it executed
> >>>> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> >>>> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> >>>> L0 needs to evaluate if this pending event should cause an exit from
> >>>> L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> >>>> EXTERNAL_INTERRUPT).
> >>>> 
> >>>> Usually this would be handled by L0 requesting a window (e.g. IRQ
> >>>> window) by setting VMCS accordingly. However, this setting was done on
> >>>> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> >>>> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> >>>> requesting a KVM_REQ_EVENT.
> >>>> 
> >>>> Note that above scenario exists when L1 KVM is about to enter L2 but
> >>>> requests an "immediate-exit". As in this case, L1 will
> >>>> disable-interrupts and then send a self-IPI before entering L2.
> >>> 
> >>> Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".
> >> 
> >> Please hold off on doing anything with this, I don't think this is the
> >> correct fix.  I have a half-finished response to the preemption timer
> >> thread that prompted this patch, I'll get that sent ASAP. 
> > 
> > Sure, thanks for the heads-up.
> 
> Sean, I think this is orthogonal to the “immediate-exit” mechanism implementation issue
> you suggest to replace with preemption-timer with interval of 0 instead of self-IPI.

I agree.  When I said I didn't think it was the correct fix, I was
thinking that we should propagate the pending interrupt from vmcs01
to vmcs02, but I realized that was wrong after analyzing everything
for the thousandth time.

So, I agree that the general direction is correct, though I think we
can narrow down when we force events to be re-evaluated and also be
more explicit in the reasoning.  And this should also override the
HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
its halt state.  I think the HALT_STATE case was why I saw L2 failing
the preemption unit test.

Hopefully I didn't mangle the patch below...

From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Wed, 29 Aug 2018 11:06:14 -0700
Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter

Force re-evaluation of events prior to running vmcs02 if vmcs01 has
a pending INTR or NMI and vmcs12 is configured to exit on the event,
in which case the event will cause a VMExit to L1 immediately after
VMEnter regardless of L2's event blocking.  Re-evaluating events is
needed to ensure L0 triggers an immediate VMExit from L2 in order to
inject the INTR or NMI into L1.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8dae47e7267a..a5395fc39cb2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	struct vmcs12 *vmcs12;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
-	u32 exit_qual;
+	u32 exit_qual, vmcs01_cpu_exec_ctrl;
 	int ret;
 
 	if (!nested_vmx_check_permission(vcpu))
@@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 
 	/*
 	 * We're finally done with prerequisite checking, and can start with
-	 * the nested entry.
+	 * the nested entry.  Snapshot the CPU-based execution controls from
+	 * vmcs01 before loading vmcs02, we'll need them to properly handle
+	 * post-VMEnter INTR/NMI injection to L1.
 	 */
+	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 
 	vmx->nested.nested_run_pending = 1;
 	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
@@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	nested_cache_shadow_vmcs12(vcpu, vmcs12);
 
 	/*
-	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
-	 * by event injection, halt vcpu.
+	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
+	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
+	 * in which case the event will cause a VMExit to L1 immediately after
+	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
+	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
+	 * inject the INTR or NMI into L1.
 	 */
-	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
+	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
+	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
+	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
+	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
 	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
+		/*
+		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
+		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
+		 * INTR/NMI to L1.
+		 */
 		vmx->nested.nested_run_pending = 0;
 		return kvm_vcpu_halt(vcpu);
 	}

Sean Christopherson Aug. 29, 2018, 6:49 p.m. UTC | #6

On Wed, Aug 29, 2018 at 11:34:52AM -0700, Sean Christopherson wrote:
> On Wed, Aug 29, 2018 at 08:12:43PM +0300, Liran Alon wrote:
> > 
> > > On 29 Aug 2018, at 19:39, Radim Krčmář <rkrcmar@redhat.com> wrote:
> > > 
> > > 2018-08-29 09:34-0700, Sean Christopherson:
> > >> On Wed, Aug 29, 2018 at 06:18:20PM +0200, Radim Krčmář wrote:
> > >>> 2018-08-29 18:43+0300, Liran Alon:
> > >>>> Consider the case L1 had a pending event until it executed
> > >>>> VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
> > >>>> (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
> > >>>> L0 needs to evaluate if this pending event should cause an exit from
> > >>>> L2 to L1 or delivered directly to L2 (In case L1 don't intercept
> > >>>> EXTERNAL_INTERRUPT).
> > >>>> 
> > >>>> Usually this would be handled by L0 requesting a window (e.g. IRQ
> > >>>> window) by setting VMCS accordingly. However, this setting was done on
> > >>>> VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
> > >>>> VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
> > >>>> requesting a KVM_REQ_EVENT.
> > >>>> 
> > >>>> Note that above scenario exists when L1 KVM is about to enter L2 but
> > >>>> requests an "immediate-exit". As in this case, L1 will
> > >>>> disable-interrupts and then send a self-IPI before entering L2.
> > >>> 
> > >>> Which makes it a big blunder, I'll add "Cc: stable@vger.kernel.org".
> > >> 
> > >> Please hold off on doing anything with this, I don't think this is the
> > >> correct fix.  I have a half-finished response to the preemption timer
> > >> thread that prompted this patch, I'll get that sent ASAP. 
> > > 
> > > Sure, thanks for the heads-up.
> > 
> > Sean, I think this is orthogonal to the “immediate-exit” mechanism implementation issue
> > you suggest to replace with preemption-timer with interval of 0 instead of self-IPI.
> 
> I agree.  When I said I didn't think it was the correct fix, I was
> thinking that we should propagate the pending interrupt from vmcs01
> to vmcs02, but I realized that was wrong after analyzing everything
> for the thousandth time.
> 
> So, I agree that the general direction is correct, though I think we
> can narrow down when we force events to be re-evaluated and also be
> more explicit in the reasoning.  And this should also override the
> HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
> its halt state.  I think the HALT_STATE case was why I saw L2 failing
> the preemption unit test.
> 
> Hopefully I didn't mangle the patch below...
> 
> From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> Date: Wed, 29 Aug 2018 11:06:14 -0700
> Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter
> 
> Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> a pending INTR or NMI and vmcs12 is configured to exit on the event,
> in which case the event will cause a VMExit to L1 immediately after
> VMEnter regardless of L2's event blocking.  Re-evaluating events is
> needed to ensure L0 triggers an immediate VMExit from L2 in order to
> inject the INTR or NMI into L1.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Doh, this obviously should have a Suggested-by or Reported-by for Liran.

> ---
>  arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 8dae47e7267a..a5395fc39cb2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	struct vmcs12 *vmcs12;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
> -	u32 exit_qual;
> +	u32 exit_qual, vmcs01_cpu_exec_ctrl;
>  	int ret;
>  
>  	if (!nested_vmx_check_permission(vcpu))
> @@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  
>  	/*
>  	 * We're finally done with prerequisite checking, and can start with
> -	 * the nested entry.
> +	 * the nested entry.  Snapshot the CPU-based execution controls from
> +	 * vmcs01 before loading vmcs02, we'll need them to properly handle
> +	 * post-VMEnter INTR/NMI injection to L1.
>  	 */
> +	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  
>  	vmx->nested.nested_run_pending = 1;
>  	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
> @@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	nested_cache_shadow_vmcs12(vcpu, vmcs12);
>  
>  	/*
> -	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
> -	 * by event injection, halt vcpu.
> +	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> +	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
> +	 * in which case the event will cause a VMExit to L1 immediately after
> +	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
> +	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
> +	 * inject the INTR or NMI into L1.
>  	 */
> -	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> +	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
> +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
> +	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
> +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
> +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
>  	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
> +		/*
> +		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
> +		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
> +		 * INTR/NMI to L1.
> +		 */
>  		vmx->nested.nested_run_pending = 0;
>  		return kvm_vcpu_halt(vcpu);
>  	}
> -- 
> 2.18.0
> 
> > In my opinion, this patch handles a general issue of losing pending interrupt queued 
> > (And disallowed from being dispatched) in L1 before entering L2. This is not just related
> > to immediate-exit mechanism. This is also true for example for a timer-interrupt that may be
> > raised L1 during the timespan in which L1 disables interrupts until he VMRESUME into L2.
> > 
> > I have actually written a small effective kvm-unit-test for this. It fails before this patch and passes after it.
> > I will submit the unit-test and Cc you guys.
> > 
> > -Liran
> > 
> >

Liran Alon Aug. 30, 2018, 12:24 a.m. UTC | #7

> On 29 Aug 2018, at 19:34, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> I agree.  When I said I didn't think it was the correct fix, I was
> thinking that we should propagate the pending interrupt from vmcs01
> to vmcs02, but I realized that was wrong after analyzing everything
> for the thousandth time.
> 
> So, I agree that the general direction is correct, though I think we
> can narrow down when we force events to be re-evaluated and also be
> more explicit in the reasoning.  And this should also override the
> HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
> its halt state.  I think the HALT_STATE case was why I saw L2 failing
> the preemption unit test.
> 
> Hopefully I didn't mangle the patch below...
> 
> From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
> 
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Date: Wed, 29 Aug 2018 11:06:14 -0700
> Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter
> 
> Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> a pending INTR or NMI and vmcs12 is configured to exit on the event,
> in which case the event will cause a VMExit to L1 immediately after
> VMEnter regardless of L2's event blocking.  Re-evaluating events is
> needed to ensure L0 triggers an immediate VMExit from L2 in order to
> inject the INTR or NMI into L1.
> 
> Cc: stable@vger.kernel.org
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> ---
>  arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 8dae47e7267a..a5395fc39cb2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	struct vmcs12 *vmcs12;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
> -	u32 exit_qual;
> +	u32 exit_qual, vmcs01_cpu_exec_ctrl;
>  	int ret;
>  
>  	if (!nested_vmx_check_permission(vcpu))
> @@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  
>  	/*
>  	 * We're finally done with prerequisite checking, and can start with
> -	 * the nested entry.
> +	 * the nested entry.  Snapshot the CPU-based execution controls from
> +	 * vmcs01 before loading vmcs02, we'll need them to properly handle
> +	 * post-VMEnter INTR/NMI injection to L1.
>  	 */
> +	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  
>  	vmx->nested.nested_run_pending = 1;
>  	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
> @@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	nested_cache_shadow_vmcs12(vcpu, vmcs12);
>  
>  	/*
> -	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
> -	 * by event injection, halt vcpu.
> +	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> +	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
> +	 * in which case the event will cause a VMExit to L1 immediately after
> +	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
> +	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
> +	 * inject the INTR or NMI into L1.
>  	 */
> -	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> +	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
> +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
> +	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
> +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
> +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
>  	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
> +		/*
> +		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
> +		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
> +		 * INTR/NMI to L1.
> +		 */
>  		vmx->nested.nested_run_pending = 0;
>  		return kvm_vcpu_halt(vcpu);
>  	}
> 

(Sorry for opening a new email thread. Something screwed up in my email client…)

I see a couple of issues with the patch you suggest:

1. When L1 don’t intercept external-interrupts, interrupt should still be injected after vmlaunch/vmresume but directly to L2.
Therefore, we should remove the tests against vmcs12->pin_based_vm_exec_control bits.
(KVM_REQ_EVENT will handle this direct injection to L2 correctly)

2. The logic of deciding if we should set KVM_REQ_EVENT should be inside enter_vmx_non_root_mode() and not nested_vmx_run().
This is because otherwise, you could reach a buggy scenario after nVMX migration which sets nVMX state directly from vmx_set_nested_state().

As I agree with you regarding your other comments, I will create a v2 to my patch and submit it. :)
(Together with the relevant kvm-unit-test)

Thanks,
-Liran

Sean Christopherson Aug. 30, 2018, 12:54 p.m. UTC | #8

On Thu, 2018-08-30 at 03:24 +0300, Liran Alon wrote:
> 
> > 
> > On 29 Aug 2018, at 19:34, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > I agree.  When I said I didn't think it was the correct fix, I was
> > thinking that we should propagate the pending interrupt from vmcs01
> > to vmcs02, but I realized that was wrong after analyzing everything
> > for the thousandth time.
> > 
> > So, I agree that the general direction is correct, though I think we
> > can narrow down when we force events to be re-evaluated and also be
> > more explicit in the reasoning.  And this should also override the
> > HALT_STATE handling, e.g. the injecting to L1 will wake the CPU from
> > its halt state.  I think the HALT_STATE case was why I saw L2 failing
> > the preemption unit test.
> > 
> > Hopefully I didn't mangle the patch below...
> > 
> > From 69617481cb5d8813046d32d2b6881e97b88a746e Mon Sep 17 00:00:00 2001
> > 
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Date: Wed, 29 Aug 2018 11:06:14 -0700
> > Subject: [PATCH] KVM: nVMX: re-evaluate events if L1 should get an INTR/NMI after VMEnter
> > 
> > Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> > a pending INTR or NMI and vmcs12 is configured to exit on the event,
> > in which case the event will cause a VMExit to L1 immediately after
> > VMEnter regardless of L2's event blocking.  Re-evaluating events is
> > needed to ensure L0 triggers an immediate VMExit from L2 in order to
> > inject the INTR or NMI into L1.
> > 
> > Cc: stable@vger.kernel.org
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > ---
> >  arch/x86/kvm/vmx.c | 27 ++++++++++++++++++++++-----
> >  1 file changed, 22 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 8dae47e7267a..a5395fc39cb2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12602,7 +12602,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  	struct vmcs12 *vmcs12;
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
> > -	u32 exit_qual;
> > +	u32 exit_qual, vmcs01_cpu_exec_ctrl;
> >  	int ret;
> >  
> >  	if (!nested_vmx_check_permission(vcpu))
> > @@ -12674,8 +12674,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  
> >  	/*
> >  	 * We're finally done with prerequisite checking, and can start with
> > -	 * the nested entry.
> > +	 * the nested entry.  Snapshot the CPU-based execution controls from
> > +	 * vmcs01 before loading vmcs02, we'll need them to properly handle
> > +	 * post-VMEnter INTR/NMI injection to L1.
> >  	 */
> > +	vmcs01_cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >  
> >  	vmx->nested.nested_run_pending = 1;
> >  	ret = enter_vmx_non_root_mode(vcpu, &exit_qual);
> > @@ -12701,11 +12704,25 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  	nested_cache_shadow_vmcs12(vcpu, vmcs12);
> >  
> >  	/*
> > -	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
> > -	 * by event injection, halt vcpu.
> > +	 * Force re-evaluation of events prior to running vmcs02 if vmcs01 has
> > +	 * a pending INTR or NMI and vmcs12 is configured to exit on the event,
> > +	 * in which case the event will cause a VMExit to L1 immediately after
> > +	 * VMEnter regardless of L2's event blocking.  Re-evaluating events is
> > +	 * needed to ensure L0 triggers an immediate VMExit from L2 in order to
> > +	 * inject the INTR or NMI into L1.
> >  	 */
> > -	if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> > +	if (((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) &&
> > +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_EXT_INTR_MASK)) ||
> > +	    ((vmcs01_cpu_exec_ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) &&
> > +	     (vmcs12->pin_based_vm_exec_control & PIN_BASED_NMI_EXITING))) {
> > +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +	} else if ((vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) &&
> >  	    !(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK)) {
> > +		/*
> > +		 * Halt the vCPU if we're entering a halted L2 vCPU and the L2
> > +		 * vCPU won't be woken by an injected event, e.g. VOE to L2 or
> > +		 * INTR/NMI to L1.
> > +		 */
> >  		vmx->nested.nested_run_pending = 0;
> >  		return kvm_vcpu_halt(vcpu);
> >  	}
> > 
> (Sorry for opening a new email thread. Something screwed up in my email client…)
> 
> I see a couple of issues with the patch you suggest:
> 
> 1. When L1 don’t intercept external-interrupts, interrupt should still be injected after vmlaunch/vmresume but directly to L2.
> Therefore, we should remove the tests against vmcs12->pin_based_vm_exec_control bits.
> (KVM_REQ_EVENT will handle this direct injection to L2 correctly)

Right, not sure why I brought vmcs12 into the picture.  The behavior
we're enforcing is the VMExit to L0, everything after that is handled
by whatever normally L0 does after the associated VMExit.

> 2. The logic of deciding if we should set KVM_REQ_EVENT should be inside enter_vmx_non_root_mode() and not nested_vmx_run().
> This is because otherwise, you could reach a buggy scenario after nVMX migration which sets nVMX state directly from vmx_set_nested_state().

In that case I think we'd want to nested_run_pending, i.e. entry from
SMM shouldn't set KVM_REQ_EVENT.  If nested_run_pending==false and one
of the pending bits is sets (in the migration case) we already goofed.

> As I agree with you regarding your other comments, I will create a v2 to my patch and submit it. :)
> (Together with the relevant kvm-unit-test)
> 
> Thanks,
> -Liran
> 
>

Sean Christopherson Aug. 30, 2018, 1:03 p.m. UTC | #9

On Thu, 2018-08-30 at 05:54 -0700, Sean Christopherson wrote:
> On Thu, 2018-08-30 at 03:24 +0300, Liran Alon wrote:
> > 
> > (Sorry for opening a new email thread. Something screwed up in my email client…)
> > 
> > I see a couple of issues with the patch you suggest:
> > 
> > 1. When L1 don’t intercept external-interrupts, interrupt should still be injected after vmlaunch/vmresume but directly to L2.
> > Therefore, we should remove the tests against vmcs12->pin_based_vm_exec_control bits.
> > (KVM_REQ_EVENT will handle this direct injection to L2 correctly)
> Right, not sure why I brought vmcs12 into the picture.  The behavior
> we're enforcing is the VMExit to L0, everything after that is handled
> by whatever normally L0 does after the associated VMExit.
> 
> > 
> > 2. The logic of deciding if we should set KVM_REQ_EVENT should be inside enter_vmx_non_root_mode() and not nested_vmx_run().
> > This is because otherwise, you could reach a buggy scenario after nVMX migration which sets nVMX state directly from vmx_set_nested_state().
> In that case I think we'd want to nested_run_pending, i.e. entry from
> SMM shouldn't set KVM_REQ_EVENT.  If nested_run_pending==false and one
> of the pending bits is sets (in the migration case) we already goofed.

Actually, ignore this.  SMI has higher priority so we could be
resuming to the first instruction of the guest, at which time
the pending INTR/NMI would be taken.

> > 
> > As I agree with you regarding your other comments, I will create a v2 to my patch and submit it. :)
> > (Together with the relevant kvm-unit-test)
> > 
> > Thanks,
> > -Liran
> > 
> >

KVM: nVMX: Fix loss of pending event before entering L2

Commit Message

Comments

Patch