diff mbox series

[6/6] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection

Message ID 20240720000138.3027780-7-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series [1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site | expand

Commit Message

Sean Christopherson July 20, 2024, 12:01 a.m. UTC
When synthensizing a nested VM-Exit due to an external interrupt, pend a
nested posted interrupt if the external interrupt vector matches L2's PI
notification vector, i.e. if the interrupt is a PI notification for L2.
This fixes a bug where KVM will incorrectly inject VM-Exit instead of
processing nested posted interrupt when IPI virtualization is enabled.

Per the SDM, detection of the notification vector doesn't occur until the
interrupt is acknowledge and deliver to the CPU core.

  If the external-interrupt exiting VM-execution control is 1, any unmasked
  external interrupt causes a VM exit (see Section 26.2). If the "process
  posted interrupts" VM-execution control is also 1, this behavior is
  changed and the processor handles an external interrupt as follows:

    1. The local APIC is acknowledged; this provides the processor core
       with an interrupt vector, called here the physical vector.
    2. If the physical vector equals the posted-interrupt notification
       vector, the logical processor continues to the next step. Otherwise,
       a VM exit occurs as it would normally due to an external interrupt;
       the vector is saved in the VM-exit interruption-information field.

For the most part, KVM has avoided problems because a PI NV for L2 that
arrives will L2 is active will be processed by hardware, and KVM checks
for a pending notification vector during nested VM-Enter.  Thus, to hit
the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while
L2 is active.

Without IPI virtualization, the scenario is practically impossible to hit
as the ordering between vmx_deliver_posted_interrupt() and nested VM-Enter
effectively guarantees that either the sender will see the vCPU as being
in_guest_mode(), or the receiver will see the interrupt in its vIRR.

With IPI virtualization, the sending CPU effectively implements a rough
equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check.
If the target vCPU has a valid PID, the CPU will send a PI NV interrupt
based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs.

  PIR := 32 bytes at PID_ADDR;
  // under lock
  PIR[V] := 1;
  store PIR at PID_ADDR;
  // release lock

  NotifyInfo := 8 bytes at PID_ADDR + 32;
  // under lock
  IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN
    NotifyInfo.ON := 1;
    SendNotify := 1;
  ELSE
    SendNotify := 0;
  FI;
  store NotifyInfo at PID_ADDR + 32;
  // release lock

  IF SendNotify = 1; THEN
    send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV;
  FI;

As a result, the target vCPU ends up receiving an interrupt on KVM's
POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for
L2's nested PI NV.  The POSTED_INTR_VECTOR interrupt triggers a VM-Exit
from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a
KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(),
effectively bypassing all of KVM's "early" checks on nested PI NV.

Note, the Fixes tag is a bit of a lie, as the bug is technically a generic
nested posted interrupt issue.  However, as above, it's practically
impossible to hit the bug without IPI virtualization being enabled.

Cc: Chao Gao <chao.gao@intel.com>
Cc: Zeng Guang <guang.zeng@intel.com>
Cc: stable@vger.kernel.org
Fixes: d588bb9be1da ("KVM: VMX: enable IPI virtualization")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Chao Gao July 23, 2024, 2:49 p.m. UTC | #1
On Fri, Jul 19, 2024 at 05:01:38PM -0700, Sean Christopherson wrote:
>When synthensizing a nested VM-Exit due to an external interrupt, pend a
>nested posted interrupt if the external interrupt vector matches L2's PI
>notification vector, i.e. if the interrupt is a PI notification for L2.
>This fixes a bug where KVM will incorrectly inject VM-Exit instead of
>processing nested posted interrupt when IPI virtualization is enabled.
>
>Per the SDM, detection of the notification vector doesn't occur until the
>interrupt is acknowledge and deliver to the CPU core.
>
>  If the external-interrupt exiting VM-execution control is 1, any unmasked
>  external interrupt causes a VM exit (see Section 26.2). If the "process
>  posted interrupts" VM-execution control is also 1, this behavior is
>  changed and the processor handles an external interrupt as follows:
>
>    1. The local APIC is acknowledged; this provides the processor core
>       with an interrupt vector, called here the physical vector.
>    2. If the physical vector equals the posted-interrupt notification
>       vector, the logical processor continues to the next step. Otherwise,
>       a VM exit occurs as it would normally due to an external interrupt;
>       the vector is saved in the VM-exit interruption-information field.
>
>For the most part, KVM has avoided problems because a PI NV for L2 that
>arrives will L2 is active will be processed by hardware, and KVM checks
>for a pending notification vector during nested VM-Enter.

With this series in place, I wonder if we can remove the check for a pending
notification vector during nested VM-Enter.

	/* Emulate processing of posted interrupts on VM-Enter. */
	if (nested_cpu_has_posted_intr(vmcs12) &&
	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
		vmx->nested.pi_pending = true;
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
	}

I believe the check is arguably incorrect because:

1. nested_vmx_run() may set pi_pending and clear the IRR bit of the notification
vector, but this doesn't guarantee that vmx_complete_nested_posted_interrupt()
will be called later in vmx_check_nested_events(). This could lead to partial
posted interrupt processing, where the IRR bit is cleared but PIR isn't copied
into VIRR. This might confuse L1 since, from L1's perspective, posted interrupt
processing should be atomic. Per the SDM, the logical processor performs
posted-interrupt processing "in an uninterruptible manner".

2. The check doesn't respect event priority. For example, if a higher-priority
event (preemption timer exit or NMI-window exit) causes an immediate nested
VM-exit, the notification vector should remain pending after the nested VM-exit.
Sean Christopherson July 23, 2024, 5:43 p.m. UTC | #2
On Tue, Jul 23, 2024, Chao Gao wrote:
> On Fri, Jul 19, 2024 at 05:01:38PM -0700, Sean Christopherson wrote:
> >When synthensizing a nested VM-Exit due to an external interrupt, pend a
> >nested posted interrupt if the external interrupt vector matches L2's PI
> >notification vector, i.e. if the interrupt is a PI notification for L2.
> >This fixes a bug where KVM will incorrectly inject VM-Exit instead of
> >processing nested posted interrupt when IPI virtualization is enabled.
> >
> >Per the SDM, detection of the notification vector doesn't occur until the
> >interrupt is acknowledge and deliver to the CPU core.
> >
> >  If the external-interrupt exiting VM-execution control is 1, any unmasked
> >  external interrupt causes a VM exit (see Section 26.2). If the "process
> >  posted interrupts" VM-execution control is also 1, this behavior is
> >  changed and the processor handles an external interrupt as follows:
> >
> >    1. The local APIC is acknowledged; this provides the processor core
> >       with an interrupt vector, called here the physical vector.
> >    2. If the physical vector equals the posted-interrupt notification
> >       vector, the logical processor continues to the next step. Otherwise,
> >       a VM exit occurs as it would normally due to an external interrupt;
> >       the vector is saved in the VM-exit interruption-information field.
> >
> >For the most part, KVM has avoided problems because a PI NV for L2 that
> >arrives will L2 is active will be processed by hardware, and KVM checks
> >for a pending notification vector during nested VM-Enter.
> 
> With this series in place, I wonder if we can remove the check for a pending
> notification vector during nested VM-Enter.
> 
> 	/* Emulate processing of posted interrupts on VM-Enter. */
> 	if (nested_cpu_has_posted_intr(vmcs12) &&
> 	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
> 		vmx->nested.pi_pending = true;
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
> 	}
> 
> I believe the check is arguably incorrect because:
> 
> 1. nested_vmx_run() may set pi_pending and clear the IRR bit of the notification
> vector, but this doesn't guarantee that vmx_complete_nested_posted_interrupt()
> will be called later in vmx_check_nested_events(). This could lead to partial
> posted interrupt processing, where the IRR bit is cleared but PIR isn't copied
> into VIRR. This might confuse L1 since, from L1's perspective, posted interrupt
> processing should be atomic. Per the SDM, the logical processor performs
> posted-interrupt processing "in an uninterruptible manner".

vmx_deliver_nested_posted_interrupt() is also broken in this regard.  I don't see
a sane way to handle that though, at least not without completely losing the value
of posted interrupts.  Ooh, maybe we could call vmx_complete_nested_posted_interrupt()
from nested_vmx_vmexit()?  That is a little scary, but probably worth trying?

> 2. The check doesn't respect event priority. For example, if a higher-priority
> event (preemption timer exit or NMI-window exit) causes an immediate nested
> VM-exit, the notification vector should remain pending after the nested VM-exit.

Ah, right, because block_nested_events would be true due to the pending nested
VM-Enter, which would ensure KVM enters L2 and trips NMI/IRQ window exiting.

The downside is that removing that code would regress performance for the more
common case (no NMI/IRQ window), as KVM would need to complete the nested
VM-Enter before consuming the IRQ, i.e. would need to do a VM-Enter and force a
VM-Exit.  But as you say, that's the architecturally correct behavior.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 40cf4839ca47..f1fe4d5a1ed8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4296,10 +4296,21 @@  static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		if (nested_exit_intr_ack_set(vcpu)) {
 			int irq;
 
-			irq = kvm_cpu_get_interrupt(vcpu, -1);
+			irq = kvm_cpu_get_interrupt(vcpu, vmx->nested.posted_intr_nv);
 			if (WARN_ON_ONCE(irq < 0))
 				goto no_vmexit;
 
+			/*
+			 * If the IRQ is L2's PI notification vector, process
+			 * posted interrupts instead of injecting VM-Exit, as
+			 * the detection/morphing architecturally occurs when
+			 * the IRQ is delivered to the CPU.  Note, enabling PI
+			 * requires ACK-on-exit.
+			 */
+			if (irq == vmx->nested.posted_intr_nv) {
+				vmx->nested.pi_pending = true;
+				goto no_vmexit;
+			}
 			exit_intr_info = INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq;
 		} else {
 			exit_intr_info = 0;