diff mbox

[2/4] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending

Message ID 1509890866-8736-3-git-send-email-liran.alon@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Liran Alon Nov. 5, 2017, 2:07 p.m. UTC
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hungs completely and analyzing the hung reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/x86.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

Comments

Paolo Bonzini Nov. 10, 2017, 11:26 p.m. UTC | #1
On 05/11/2017 15:07, Liran Alon wrote:
> +	/*
> +	 * If we reinjected a previous event,
> +	 * don't consider a new pending event
> +	 */
> +	if (kvm_event_needs_reinjection(vcpu))
> +		return 0;
> +

Could you end up with

                        WARN_ON(vcpu->arch.exception.pending);

in vcpu_enter_guest after returning 0 here?

Maybe it would be safer to return a non-zero value so that the caller
sets req_immediate_exit = true.  But I haven't really thought through
the consequences.

Thanks,

Paolo
Liran Alon Nov. 11, 2017, 1:44 a.m. UTC | #2
On 11/11/17 01:26, Paolo Bonzini wrote:
> On 05/11/2017 15:07, Liran Alon wrote:
>> +	/*
>> +	 * If we reinjected a previous event,
>> +	 * don't consider a new pending event
>> +	 */
>> +	if (kvm_event_needs_reinjection(vcpu))
>> +		return 0;
>> +
>
> Could you end up with
>
>                          WARN_ON(vcpu->arch.exception.pending);
>
> in vcpu_enter_guest after returning 0 here?
>
> Maybe it would be safer to return a non-zero value so that the caller
> sets req_immediate_exit = true.  But I haven't really thought through
> the consequences.

The only difference before and after this patch *should* have been that 
now if L1 has pending event (as specified by vmx_check_nested_events()), 
a value of -EBUSY will be returned and an immediate-exit will be 
requested. Even if a re-injection had occurred.
If that is not the case, previous code and this code should return 0 on 
the exact same cases.

*However*, if exception.pending=true and 
nmi_injected/interrupt.pending=true, then previous code would have 
continued inject_pending_event() while this code would return too-soon.
Indeed triggering the above mentioned warning.

Therefore I think you have found here a bug that was missed in the 
review-chain from some reason and wasn't observed in tests... Will 
investigate how tests didn't caught this. Sorry for this.
It seems that this patch approach would have worked on version before 
commit 664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet 
been injected") but afterwards will break...

Will fix and re-run all tests.

Thanks,

-Liran

>
> Thanks,
>
> Paolo
>
Paolo Bonzini Nov. 13, 2017, 8:38 a.m. UTC | #3
On 11/11/2017 02:44, Liran Alon wrote:
> 
> Therefore I think you have found here a bug that was missed in the
> review-chain from some reason and wasn't observed in tests... Will
> investigate how tests didn't caught this. Sorry for this.

Probably because the exception.pending case is very rare right now.  It
will become more common however if I can get rid of
kvm_inject_page_fault_nested.  This more or less also answers your
question as to why this patch is related to the removal of
kvm_inject_page_fault_nested.

Thanks,

Paolo
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb7fcd6..41f199287edf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6379,33 +6379,38 @@  static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	int r;
 
 	/* try to reinject previous events if any */
-	if (vcpu->arch.exception.injected) {
+	if (vcpu->arch.exception.injected)
 		kvm_x86_ops->queue_exception(vcpu);
-		return 0;
-	}
-
 	/*
 	 * Exceptions must be injected immediately, or the exception
 	 * frame will have the address of the NMI or interrupt handler.
 	 */
-	if (!vcpu->arch.exception.pending) {
-		if (vcpu->arch.nmi_injected) {
+	else if (!vcpu->arch.exception.pending) {
+		if (vcpu->arch.nmi_injected)
 			kvm_x86_ops->set_nmi(vcpu);
-			return 0;
-		}
-
-		if (vcpu->arch.interrupt.pending) {
+		else if (vcpu->arch.interrupt.pending)
 			kvm_x86_ops->set_irq(vcpu);
-			return 0;
-		}
 	}
 
+	/*
+	 * Call check_nested_events() even if we reinjected a previous event
+	 * in order for caller to determine if it should require immediate-exit
+	 * from L2 to L1 due to pending L1 events which require exit
+	 * from L2 to L1.
+	 */
 	if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
 		r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
 		if (r != 0)
 			return r;
 	}
 
+	/*
+	 * If we reinjected a previous event,
+	 * don't consider a new pending event
+	 */
+	if (kvm_event_needs_reinjection(vcpu))
+		return 0;
+
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
 		trace_kvm_inj_exception(vcpu->arch.exception.nr,