Message ID | 1511278211-12257-8-git-send-email-liran.alon@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
I am concerned about the possible conflation of L1 and L2 events. Vmx_check_nested_events() reads as though vcpu->arch.exception.pending is always associated with an L1 event, yet inject_pending_event() reads as though vcpu->arch.exception.injected is associated with the current level. That seems odd, to split this structure that way. Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual interrupt is interrupted by another event. But vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an L1 "physical" interrupt should cause a VM-exit from L2 to L1, and kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when !lapic_in_kernel(). Clearly, there is some disagreement here over what vcpu->arch.interrupt.pending means. I think there may be some more fundamental problems lurking here. On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: > In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with > IDT-vectoring-info which vmx_complete_interrupts() makes sure to > reinject before next resume of L2. > > While handling the VMExit in L0, an IPI could be sent by another L1 vCPU > to the L1 vCPU which currently runs L2 and exited to L0. > > When L0 will reach vcpu_enter_guest() and call inject_pending_event(), > it will note that a previous event was re-injected to L2 (by > IDT-vectoring-info) and therefore won't check if there are pending L1 > events which require exit from L2 to L1. Thus, L0 enters L2 without > immediate VMExit even though there are pending L1 events! > > This commit fixes the issue by making sure to check for L1 pending > events even if a previous event was reinjected to L2 and bailing out > from inject_pending_event() before evaluating a new pending event in > case an event was already reinjected. > > The bug was observed by the following setup: > * L0 is a 64CPU machine which runs KVM. > * L1 is a 16CPU machine which runs KVM. > * L0 & L1 runs with APICv disabled. > (Also reproduced with APICv enabled but easier to analyze below info > with APICv disabled) > * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. > During L2 boot, L1 hangs completely and analyzing the hang reveals that > one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI > that he has sent for another L1 vCPU. And all other L1 vCPUs are > currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck > forever (as L1 runs with kernel-preemption disabled). > > Observing /sys/kernel/debug/tracing/trace_pipe reveals the following > series of events: > (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: > 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 > ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 > (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f > vec 252 (Fixed|edge) > (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 > (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 > (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION > rip 0xffffe00069202690 info 83 0 > (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: > 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 > ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 > (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: > EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 > ext_int: 0x00000000 ext_int_err: 0x00000000 > (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 > > Which can be analyzed as follows: > (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. > Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject > a pending-interrupt of 0xd2. > (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination > vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. > (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which > notes that interrupt 0xd2 was reinjected and therefore calls > vmx_inject_irq() and returns. Without checking for pending L1 events! > Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() > before calling inject_pending_event(). > (4) L0 resumes L2 without immediate-exit even though there is a pending > L1 event (The IPI pending in LAPIC's IRR). > > We have already reached the buggy scenario but events could be > furthered analyzed: > (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during > event-delivery. > (7) L0 decides to forward the VMExit to L1 for further handling. > (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the > LAPIC's IRR is not examined and therefore the IPI is still not delivered > into L1! > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- > 1 file changed, 20 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a4b5a013064b..63359ab0e798 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > int r; > > /* try to reinject previous events if any */ > - if (vcpu->arch.exception.injected) { > - kvm_x86_ops->queue_exception(vcpu); > - return 0; > - } > > + if (vcpu->arch.exception.injected) > + kvm_x86_ops->queue_exception(vcpu); > /* > * NMI/interrupt must not be injected if an exception is > * pending, because the latter may have been queued by > * handling exit due to NMI/interrupt delivery. > */ > - if (!vcpu->arch.exception.pending) { > - if (vcpu->arch.nmi_injected) { > + else if (!vcpu->arch.exception.pending) { > + if (vcpu->arch.nmi_injected) > kvm_x86_ops->set_nmi(vcpu); > - return 0; > - } > - > - if (vcpu->arch.interrupt.injected) { > + else if (vcpu->arch.interrupt.injected) > kvm_x86_ops->set_irq(vcpu); > - return 0; > - } > } > > + /* > + * Call check_nested_events() even if we reinjected a previous event > + * in order for caller to determine if it should require immediate-exit > + * from L2 to L1 due to pending L1 events which require exit > + * from L2 to L1. > + */ > if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { > r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); > if (r != 0) > @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > vcpu->arch.exception.has_error_code, > vcpu->arch.exception.error_code); > > + WARN_ON_ONCE(vcpu->arch.exception.injected); > vcpu->arch.exception.pending = false; > vcpu->arch.exception.injected = true; > > @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > } > > kvm_x86_ops->queue_exception(vcpu); > - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { > + } > + > + /* Don't consider new event if we re-injected an event */ > + if (kvm_event_needs_reinjection(vcpu)) > + return 0; > + > + if (vcpu->arch.smi_pending && !is_smm(vcpu) && > + kvm_x86_ops->smi_allowed(vcpu)) { > vcpu->arch.smi_pending = false; > enter_smm(vcpu); > } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { > -- > 1.9.1 >
On 27/11/17 22:48, Jim Mattson wrote: > I am concerned about the possible conflation of L1 and L2 events. > Vmx_check_nested_events() reads as though vcpu->arch.exception.pending > is always associated with an L1 event, yet inject_pending_event() > reads as though vcpu->arch.exception.injected is associated with the > current level. That seems odd, to split this structure that way. I think you misunderstand code (rightfully as it is confusing...). vmx_check_nested_events() treats exception.pending as related to current-level (L1 or L2). Specifically, because vmx_check_nested_events() is always called when is_guest_mode(vcpu)==true, then it treats it as a "pending L2 exception". This is also why it checks if L1 intercept the "pending L2 exception" and if yes, perform an exit from L2 to L1. I understand the structures in the following way: 1. "injected" status is always associated with the current-level. (Same for interrupt.pending because it is actually interrupt.injected, see comment below marked with (*)) 2. exception.pending is associated with current-level. 3. nmi_pending is associated with L1. 4. In general, all the exception/nmi/interrupt structures are "supposed" to follow the same convention: (a) A "pending" status means that the event should be injected to guest but it's side-effects have not occurred yet (not yet injected to guest). (b) An "injected" status means that the event was already "processed" and therefore injected to guest along with it's side-effects. (*) Currently, interrupt.pending is a confusing wrong name. It actually represents "interrupt.injected". This is why I made the a patch which renames this variable: ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected") https://marc.info/?l=kvm&m=151127826204179&w=2 (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is because it's "pending" status can be considered to be held in LAPIC IRR. This is why for example vmx_check_nested_events() call kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt that may cause exit from L2 to L1). > > Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set > vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual > interrupt is interrupted by another event. But > vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an > L1 "physical" interrupt should cause a VM-exit from L2 to L1, and > kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when > !lapic_in_kernel(). Clearly, there is some disagreement here over what > vcpu->arch.interrupt.pending means. Again, there is some confusion here. __vmx_complete_interrupts() only re-queues an "injected" event. (Either due to exit during event-delivery and therefore valid IDT-vectoring-info or because of injection-cancellation in vcpu_enter_guest()). Therefore, this function should only re-insert events in "injected" state. As it really does: 1. NMI is re-injected with nmi_injected status. 2. exception is injected using kvm_requeue_exception() which will mark exception.injected=true. 3. interrupt is injected using kvm_queue_interrupt() which indeed mark interrupt.pending. But as stated above, this is just a wrong name and it is actually "interrupt.injected". > > I think there may be some more fundamental problems lurking here. Yep. You are indeed right. We have found several more issues revolving treatments of pending events in regard to nested-virtualization. I am about to post another patch series which handles some such cases which relates to nested-posted-interrupts handling. Stay tuned. Your review will be extremely valuable :) Regards, -Liran > > On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >> reinject before next resume of L2. >> >> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >> to the L1 vCPU which currently runs L2 and exited to L0. >> >> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >> it will note that a previous event was re-injected to L2 (by >> IDT-vectoring-info) and therefore won't check if there are pending L1 >> events which require exit from L2 to L1. Thus, L0 enters L2 without >> immediate VMExit even though there are pending L1 events! >> >> This commit fixes the issue by making sure to check for L1 pending >> events even if a previous event was reinjected to L2 and bailing out >> from inject_pending_event() before evaluating a new pending event in >> case an event was already reinjected. >> >> The bug was observed by the following setup: >> * L0 is a 64CPU machine which runs KVM. >> * L1 is a 16CPU machine which runs KVM. >> * L0 & L1 runs with APICv disabled. >> (Also reproduced with APICv enabled but easier to analyze below info >> with APICv disabled) >> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >> During L2 boot, L1 hangs completely and analyzing the hang reveals that >> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >> that he has sent for another L1 vCPU. And all other L1 vCPUs are >> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >> forever (as L1 runs with kernel-preemption disabled). >> >> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >> series of events: >> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >> vec 252 (Fixed|edge) >> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >> rip 0xffffe00069202690 info 83 0 >> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >> ext_int: 0x00000000 ext_int_err: 0x00000000 >> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> >> Which can be analyzed as follows: >> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >> a pending-interrupt of 0xd2. >> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >> notes that interrupt 0xd2 was reinjected and therefore calls >> vmx_inject_irq() and returns. Without checking for pending L1 events! >> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >> before calling inject_pending_event(). >> (4) L0 resumes L2 without immediate-exit even though there is a pending >> L1 event (The IPI pending in LAPIC's IRR). >> >> We have already reached the buggy scenario but events could be >> furthered analyzed: >> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >> event-delivery. >> (7) L0 decides to forward the VMExit to L1 for further handling. >> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >> LAPIC's IRR is not examined and therefore the IPI is still not delivered >> into L1! >> >> Signed-off-by: Liran Alon <liran.alon@oracle.com> >> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> --- >> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >> 1 file changed, 20 insertions(+), 13 deletions(-) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index a4b5a013064b..63359ab0e798 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> int r; >> >> /* try to reinject previous events if any */ >> - if (vcpu->arch.exception.injected) { >> - kvm_x86_ops->queue_exception(vcpu); >> - return 0; >> - } >> >> + if (vcpu->arch.exception.injected) >> + kvm_x86_ops->queue_exception(vcpu); >> /* >> * NMI/interrupt must not be injected if an exception is >> * pending, because the latter may have been queued by >> * handling exit due to NMI/interrupt delivery. >> */ >> - if (!vcpu->arch.exception.pending) { >> - if (vcpu->arch.nmi_injected) { >> + else if (!vcpu->arch.exception.pending) { >> + if (vcpu->arch.nmi_injected) >> kvm_x86_ops->set_nmi(vcpu); >> - return 0; >> - } >> - >> - if (vcpu->arch.interrupt.injected) { >> + else if (vcpu->arch.interrupt.injected) >> kvm_x86_ops->set_irq(vcpu); >> - return 0; >> - } >> } >> >> + /* >> + * Call check_nested_events() even if we reinjected a previous event >> + * in order for caller to determine if it should require immediate-exit >> + * from L2 to L1 due to pending L1 events which require exit >> + * from L2 to L1. >> + */ >> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >> if (r != 0) >> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> vcpu->arch.exception.has_error_code, >> vcpu->arch.exception.error_code); >> >> + WARN_ON_ONCE(vcpu->arch.exception.injected); >> vcpu->arch.exception.pending = false; >> vcpu->arch.exception.injected = true; >> >> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> } >> >> kvm_x86_ops->queue_exception(vcpu); >> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >> + } >> + >> + /* Don't consider new event if we re-injected an event */ >> + if (kvm_event_needs_reinjection(vcpu)) >> + return 0; >> + >> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >> + kvm_x86_ops->smi_allowed(vcpu)) { >> vcpu->arch.smi_pending = false; >> enter_smm(vcpu); >> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >> -- >> 1.9.1 >>
On Mon, Nov 27, 2017 at 2:42 PM, Liran Alon <LIRAN.ALON@oracle.com> wrote: > > > On 27/11/17 22:48, Jim Mattson wrote: >> >> I am concerned about the possible conflation of L1 and L2 events. >> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending >> is always associated with an L1 event, yet inject_pending_event() >> reads as though vcpu->arch.exception.injected is associated with the >> current level. That seems odd, to split this structure that way. > > > I think you misunderstand code (rightfully as it is confusing...). > vmx_check_nested_events() treats exception.pending as related to > current-level (L1 or L2). Specifically, because vmx_check_nested_events() is > always called when is_guest_mode(vcpu)==true, then it treats it as a > "pending L2 exception". This is also why it checks if L1 intercept the > "pending L2 exception" and if yes, perform an exit from L2 to L1. > > I understand the structures in the following way: > 1. "injected" status is always associated with the current-level. > (Same for interrupt.pending because it is actually interrupt.injected, see > comment below marked with (*)) > 2. exception.pending is associated with current-level. > 3. nmi_pending is associated with L1. > 4. In general, all the exception/nmi/interrupt structures are "supposed" to > follow the same convention: > (a) A "pending" status means that the event should be injected to guest but > it's side-effects have not occurred yet (not yet injected to guest). > (b) An "injected" status means that the event was already "processed" and > therefore injected to guest along with it's side-effects. > > (*) Currently, interrupt.pending is a confusing wrong name. It actually > represents "interrupt.injected". This is why I made the a patch which > renames this variable: > ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected") > https://marc.info/?l=kvm&m=151127826204179&w=2 > (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is > because it's "pending" status can be considered to be held in LAPIC IRR. > This is why for example vmx_check_nested_events() call > kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt > that may cause exit from L2 to L1). > >> >> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set >> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual >> interrupt is interrupted by another event. But >> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an >> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and >> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when >> !lapic_in_kernel(). Clearly, there is some disagreement here over what >> vcpu->arch.interrupt.pending means. > > > Again, there is some confusion here. > > __vmx_complete_interrupts() only re-queues an "injected" event. (Either due > to exit during event-delivery and therefore valid IDT-vectoring-info or > because of injection-cancellation in vcpu_enter_guest()). > Therefore, this function should only re-insert events in "injected" state. > As it really does: > 1. NMI is re-injected with nmi_injected status. > 2. exception is injected using kvm_requeue_exception() which will mark > exception.injected=true. > 3. interrupt is injected using kvm_queue_interrupt() which indeed mark > interrupt.pending. But as stated above, this is just a wrong name and it is > actually "interrupt.injected". kvm_queue_interrupt() begins as follows: vcpu->arch.interrupt.pending = true; kvm_cpu_has_interrupt() begins as follows: if (!lapic_in_kernel(v)) return v->arch.interrupt.pending; In the referenced [patch 2/8], you change interrupt.pending to interrupt.injected, but the same field is still referenced by these two functions. It's clear that __vmx_complete_interrupts sets this field to indicate an interrupted L2 virtual interrupt, and I agree with your assertion above that kvm_cpu_has_interrupt() *should* determine if there is a "pending" L1 physical interrupt that may cause exit from L2 to L1, but it seems to me that the injected L2 event and the pending L1 event have been conflated in this field, at least when lapic_in_kernel() is false. >> >> I think there may be some more fundamental problems lurking here. > > Yep. You are indeed right. > We have found several more issues revolving treatments of pending events in > regard to nested-virtualization. > I am about to post another patch series which handles some such cases which > relates to nested-posted-interrupts handling. Stay tuned. Your review will > be extremely valuable :) > > Regards, > -Liran > > >> >> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >>> >>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >>> reinject before next resume of L2. >>> >>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >>> to the L1 vCPU which currently runs L2 and exited to L0. >>> >>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >>> it will note that a previous event was re-injected to L2 (by >>> IDT-vectoring-info) and therefore won't check if there are pending L1 >>> events which require exit from L2 to L1. Thus, L0 enters L2 without >>> immediate VMExit even though there are pending L1 events! >>> >>> This commit fixes the issue by making sure to check for L1 pending >>> events even if a previous event was reinjected to L2 and bailing out >>> from inject_pending_event() before evaluating a new pending event in >>> case an event was already reinjected. >>> >>> The bug was observed by the following setup: >>> * L0 is a 64CPU machine which runs KVM. >>> * L1 is a 16CPU machine which runs KVM. >>> * L0 & L1 runs with APICv disabled. >>> (Also reproduced with APICv enabled but easier to analyze below info >>> with APICv disabled) >>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >>> During L2 boot, L1 hangs completely and analyzing the hang reveals that >>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >>> that he has sent for another L1 vCPU. And all other L1 vCPUs are >>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >>> forever (as L1 runs with kernel-preemption disabled). >>> >>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >>> series of events: >>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >>> vec 252 (Fixed|edge) >>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >>> rip 0xffffe00069202690 info 83 0 >>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >>> ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> >>> Which can be analyzed as follows: >>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >>> a pending-interrupt of 0xd2. >>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >>> notes that interrupt 0xd2 was reinjected and therefore calls >>> vmx_inject_irq() and returns. Without checking for pending L1 events! >>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >>> before calling inject_pending_event(). >>> (4) L0 resumes L2 without immediate-exit even though there is a pending >>> L1 event (The IPI pending in LAPIC's IRR). >>> >>> We have already reached the buggy scenario but events could be >>> furthered analyzed: >>> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >>> event-delivery. >>> (7) L0 decides to forward the VMExit to L1 for further handling. >>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >>> LAPIC's IRR is not examined and therefore the IPI is still not delivered >>> into L1! >>> >>> Signed-off-by: Liran Alon <liran.alon@oracle.com> >>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>> --- >>> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >>> 1 file changed, 20 insertions(+), 13 deletions(-) >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index a4b5a013064b..63359ab0e798 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu >>> *vcpu, bool req_int_win) >>> int r; >>> >>> /* try to reinject previous events if any */ >>> - if (vcpu->arch.exception.injected) { >>> - kvm_x86_ops->queue_exception(vcpu); >>> - return 0; >>> - } >>> >>> + if (vcpu->arch.exception.injected) >>> + kvm_x86_ops->queue_exception(vcpu); >>> /* >>> * NMI/interrupt must not be injected if an exception is >>> * pending, because the latter may have been queued by >>> * handling exit due to NMI/interrupt delivery. >>> */ >>> - if (!vcpu->arch.exception.pending) { >>> - if (vcpu->arch.nmi_injected) { >>> + else if (!vcpu->arch.exception.pending) { >>> + if (vcpu->arch.nmi_injected) >>> kvm_x86_ops->set_nmi(vcpu); >>> - return 0; >>> - } >>> - >>> - if (vcpu->arch.interrupt.injected) { >>> + else if (vcpu->arch.interrupt.injected) >>> kvm_x86_ops->set_irq(vcpu); >>> - return 0; >>> - } >>> } >>> >>> + /* >>> + * Call check_nested_events() even if we reinjected a previous >>> event >>> + * in order for caller to determine if it should require >>> immediate-exit >>> + * from L2 to L1 due to pending L1 events which require exit >>> + * from L2 to L1. >>> + */ >>> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >>> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >>> if (r != 0) >>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu >>> *vcpu, bool req_int_win) >>> >>> vcpu->arch.exception.has_error_code, >>> >>> vcpu->arch.exception.error_code); >>> >>> + WARN_ON_ONCE(vcpu->arch.exception.injected); >>> vcpu->arch.exception.pending = false; >>> vcpu->arch.exception.injected = true; >>> >>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu >>> *vcpu, bool req_int_win) >>> } >>> >>> kvm_x86_ops->queue_exception(vcpu); >>> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>> kvm_x86_ops->smi_allowed(vcpu)) { >>> + } >>> + >>> + /* Don't consider new event if we re-injected an event */ >>> + if (kvm_event_needs_reinjection(vcpu)) >>> + return 0; >>> + >>> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>> + kvm_x86_ops->smi_allowed(vcpu)) { >>> vcpu->arch.smi_pending = false; >>> enter_smm(vcpu); >>> } else if (vcpu->arch.nmi_pending && >>> kvm_x86_ops->nmi_allowed(vcpu)) { >>> -- >>> 1.9.1 >>> >
On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: > In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with > IDT-vectoring-info which vmx_complete_interrupts() makes sure to > reinject before next resume of L2. > > While handling the VMExit in L0, an IPI could be sent by another L1 vCPU > to the L1 vCPU which currently runs L2 and exited to L0. > > When L0 will reach vcpu_enter_guest() and call inject_pending_event(), > it will note that a previous event was re-injected to L2 (by > IDT-vectoring-info) and therefore won't check if there are pending L1 > events which require exit from L2 to L1. Thus, L0 enters L2 without > immediate VMExit even though there are pending L1 events! > > This commit fixes the issue by making sure to check for L1 pending > events even if a previous event was reinjected to L2 and bailing out > from inject_pending_event() before evaluating a new pending event in > case an event was already reinjected. > > The bug was observed by the following setup: > * L0 is a 64CPU machine which runs KVM. > * L1 is a 16CPU machine which runs KVM. > * L0 & L1 runs with APICv disabled. > (Also reproduced with APICv enabled but easier to analyze below info > with APICv disabled) > * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. > During L2 boot, L1 hangs completely and analyzing the hang reveals that > one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI > that he has sent for another L1 vCPU. And all other L1 vCPUs are > currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck > forever (as L1 runs with kernel-preemption disabled). > > Observing /sys/kernel/debug/tracing/trace_pipe reveals the following > series of events: > (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: > 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 > ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 This looks like an EPT violation for writing a stack page that wasn't present in the vmcs02 shadow EPT tables. I assume that L0 filled in the missing shadow EPT table entry (or entries) and dismissed this EPT violation itself rather than forwarding it to L1, because L1's EPT tables have a valid entry for this L2 guest-physical address. > (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f > vec 252 (Fixed|edge) This is the source of the bug, right here. An interrupt can only be accepted at an instruction boundary, and the virtual CPU is in the midst of IDT-vectoring, which is not at an instruction boundary. The EPT violation from L2 to L0 is an L0 artifact that should be completely invisible to L1/L2. Interrupt 252 should remain pending until we are at the first instruction of the IDT handler for interrupt 210. To be more precise, if interrupt 210 was delivered via VM-entry event injection on L1 VM-entry to L2, then we need to: a) complete the event injection (i.e. vector through the L2 IDT for vector 210), b) handle virtual VMX-preemption timer expiration during VM-entry (if any), c) handle NMI-window exiting events (if any), d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the L2 IDT, depending on the NMI-exiting control), e) handle interrupt-window exiting events, f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by vectoring through the L2 IDT, depending on the external-iinterrupt exiting control). ... Only when we get to step (f) can we accept a new IRQ from L0's local APIC. > (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 > (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 > (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION > rip 0xffffe00069202690 info 83 0 > (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: > 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 > ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 > (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: > EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 > ext_int: 0x00000000 ext_int_err: 0x00000000 > (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 > > Which can be analyzed as follows: > (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. > Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject > a pending-interrupt of 0xd2. > (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination > vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. > (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which > notes that interrupt 0xd2 was reinjected and therefore calls > vmx_inject_irq() and returns. Without checking for pending L1 events! > Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() > before calling inject_pending_event(). > (4) L0 resumes L2 without immediate-exit even though there is a pending > L1 event (The IPI pending in LAPIC's IRR). > > We have already reached the buggy scenario but events could be > furthered analyzed: > (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during > event-delivery. > (7) L0 decides to forward the VMExit to L1 for further handling. > (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the > LAPIC's IRR is not examined and therefore the IPI is still not delivered > into L1! > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- > 1 file changed, 20 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a4b5a013064b..63359ab0e798 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > int r; > > /* try to reinject previous events if any */ > - if (vcpu->arch.exception.injected) { > - kvm_x86_ops->queue_exception(vcpu); > - return 0; > - } > > + if (vcpu->arch.exception.injected) > + kvm_x86_ops->queue_exception(vcpu); > /* > * NMI/interrupt must not be injected if an exception is > * pending, because the latter may have been queued by > * handling exit due to NMI/interrupt delivery. > */ > - if (!vcpu->arch.exception.pending) { > - if (vcpu->arch.nmi_injected) { > + else if (!vcpu->arch.exception.pending) { > + if (vcpu->arch.nmi_injected) > kvm_x86_ops->set_nmi(vcpu); > - return 0; > - } > - > - if (vcpu->arch.interrupt.injected) { > + else if (vcpu->arch.interrupt.injected) > kvm_x86_ops->set_irq(vcpu); > - return 0; > - } > } > > + /* > + * Call check_nested_events() even if we reinjected a previous event > + * in order for caller to determine if it should require immediate-exit > + * from L2 to L1 due to pending L1 events which require exit > + * from L2 to L1. > + */ > if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { > r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); > if (r != 0) > @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > vcpu->arch.exception.has_error_code, > vcpu->arch.exception.error_code); > > + WARN_ON_ONCE(vcpu->arch.exception.injected); > vcpu->arch.exception.pending = false; > vcpu->arch.exception.injected = true; > > @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) > } > > kvm_x86_ops->queue_exception(vcpu); > - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { > + } > + > + /* Don't consider new event if we re-injected an event */ > + if (kvm_event_needs_reinjection(vcpu)) > + return 0; > + > + if (vcpu->arch.smi_pending && !is_smm(vcpu) && > + kvm_x86_ops->smi_allowed(vcpu)) { > vcpu->arch.smi_pending = false; > enter_smm(vcpu); > } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { > -- > 1.9.1 >
On 28/11/2017 05:55, Jim Mattson wrote: > kvm_queue_interrupt() begins as follows: > vcpu->arch.interrupt.pending = true; > > kvm_cpu_has_interrupt() begins as follows: > if (!lapic_in_kernel(v)) > return v->arch.interrupt.pending; > > In the referenced [patch 2/8], you change interrupt.pending to > interrupt.injected, but the same field is still referenced by these > two functions. We cannot remove the !lapic_in_kernel(v) case, but it's okay if we restrict nested VMX/SVM in CPUID when it is disabled (that is, check for !lapic_in_kernel in nested_svm_check_permissions and nested_vmx_allowed, so that setting VMXE and SVME will fail). Thanks, Paolo
On 28/11/17 06:55, Jim Mattson wrote: > On Mon, Nov 27, 2017 at 2:42 PM, Liran Alon <LIRAN.ALON@oracle.com> wrote: >> >> >> On 27/11/17 22:48, Jim Mattson wrote: >>> >>> I am concerned about the possible conflation of L1 and L2 events. >>> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending >>> is always associated with an L1 event, yet inject_pending_event() >>> reads as though vcpu->arch.exception.injected is associated with the >>> current level. That seems odd, to split this structure that way. >> >> >> I think you misunderstand code (rightfully as it is confusing...). >> vmx_check_nested_events() treats exception.pending as related to >> current-level (L1 or L2). Specifically, because vmx_check_nested_events() is >> always called when is_guest_mode(vcpu)==true, then it treats it as a >> "pending L2 exception". This is also why it checks if L1 intercept the >> "pending L2 exception" and if yes, perform an exit from L2 to L1. >> >> I understand the structures in the following way: >> 1. "injected" status is always associated with the current-level. >> (Same for interrupt.pending because it is actually interrupt.injected, see >> comment below marked with (*)) >> 2. exception.pending is associated with current-level. >> 3. nmi_pending is associated with L1. >> 4. In general, all the exception/nmi/interrupt structures are "supposed" to >> follow the same convention: >> (a) A "pending" status means that the event should be injected to guest but >> it's side-effects have not occurred yet (not yet injected to guest). >> (b) An "injected" status means that the event was already "processed" and >> therefore injected to guest along with it's side-effects. >> >> (*) Currently, interrupt.pending is a confusing wrong name. It actually >> represents "interrupt.injected". This is why I made the a patch which >> renames this variable: >> ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected") >> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dkvm-26m-3D151127826204179-26w-3D2&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jEXMkXjEtpUV8ujOR5HKAievAdmBLs-ernvMak55rWI&s=7LxsKUsoJs_ZxeyBEm9t8TtPML-Zhcim11Bdx6TKA60&e= >> (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is >> because it's "pending" status can be considered to be held in LAPIC IRR. >> This is why for example vmx_check_nested_events() call >> kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt >> that may cause exit from L2 to L1). >> >>> >>> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set >>> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual >>> interrupt is interrupted by another event. But >>> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an >>> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and >>> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when >>> !lapic_in_kernel(). Clearly, there is some disagreement here over what >>> vcpu->arch.interrupt.pending means. >> >> >> Again, there is some confusion here. >> >> __vmx_complete_interrupts() only re-queues an "injected" event. (Either due >> to exit during event-delivery and therefore valid IDT-vectoring-info or >> because of injection-cancellation in vcpu_enter_guest()). >> Therefore, this function should only re-insert events in "injected" state. >> As it really does: >> 1. NMI is re-injected with nmi_injected status. >> 2. exception is injected using kvm_requeue_exception() which will mark >> exception.injected=true. >> 3. interrupt is injected using kvm_queue_interrupt() which indeed mark >> interrupt.pending. But as stated above, this is just a wrong name and it is >> actually "interrupt.injected". > > kvm_queue_interrupt() begins as follows: > vcpu->arch.interrupt.pending = true; > > kvm_cpu_has_interrupt() begins as follows: > if (!lapic_in_kernel(v)) > return v->arch.interrupt.pending; > > In the referenced [patch 2/8], you change interrupt.pending to > interrupt.injected, but the same field is still referenced by these > two functions. > It's clear that __vmx_complete_interrupts sets this field to indicate > an interrupted L2 virtual interrupt, and I agree with your assertion > above that kvm_cpu_has_interrupt() *should* determine if there is a > "pending" L1 physical interrupt that may cause exit from L2 to L1, but > it seems to me that the injected L2 event and the pending L1 event > have been conflated in this field, at least when lapic_in_kernel() is > false. First of all, I hope we agree that in case lapic_in_kernel()==true then interrupt.pending actually represents interrupt.injected. In regards to lapic_in_kernel()==false: "interrupt.pending" can currently only be set by user-mode in the following ways: 1. KVM_INTERRUPT ioctl. 2. KVM_SET_VCPU_EVENTS ioctl. 3. KVM_SET_SREGS ioctl. Looking at how QEMU use these ioctls, one can see that they use it either to re-set an "interrupt.injected" state it has received from KVM (via KVM_GET_VCPU_EVENTS interrupt.injected or via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR before sending ioctl to KVM. So it is true that "interrupt.pending" in KVM always represents "interrupt.injected". One could argue however that kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr() specifically is misusing interrupt.injected in order to return if there is a pending-interrupt in case lapic_in_kernel()==false. I agree with this. However, all other places in KVM code, as far as I can tell, don't misuse this flag and treat it as "interrupt.injected". Because the issue with these functions always existed and it is unrelated to the fixes introduced by this series (and specially this nVMX fix patch), I suggest modifying my patch ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected") to also add a FIXME comment on top of these places. What do you think? -Liran > >>> >>> I think there may be some more fundamental problems lurking here. >> >> Yep. You are indeed right. >> We have found several more issues revolving treatments of pending events in >> regard to nested-virtualization. >> I am about to post another patch series which handles some such cases which >> relates to nested-posted-interrupts handling. Stay tuned. Your review will >> be extremely valuable :) >> >> Regards, >> -Liran >> >> >>> >>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >>>> >>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >>>> reinject before next resume of L2. >>>> >>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >>>> to the L1 vCPU which currently runs L2 and exited to L0. >>>> >>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >>>> it will note that a previous event was re-injected to L2 (by >>>> IDT-vectoring-info) and therefore won't check if there are pending L1 >>>> events which require exit from L2 to L1. Thus, L0 enters L2 without >>>> immediate VMExit even though there are pending L1 events! >>>> >>>> This commit fixes the issue by making sure to check for L1 pending >>>> events even if a previous event was reinjected to L2 and bailing out >>>> from inject_pending_event() before evaluating a new pending event in >>>> case an event was already reinjected. >>>> >>>> The bug was observed by the following setup: >>>> * L0 is a 64CPU machine which runs KVM. >>>> * L1 is a 16CPU machine which runs KVM. >>>> * L0 & L1 runs with APICv disabled. >>>> (Also reproduced with APICv enabled but easier to analyze below info >>>> with APICv disabled) >>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that >>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are >>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >>>> forever (as L1 runs with kernel-preemption disabled). >>>> >>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >>>> series of events: >>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >>>> vec 252 (Fixed|edge) >>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >>>> rip 0xffffe00069202690 info 83 0 >>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >>>> ext_int: 0x00000000 ext_int_err: 0x00000000 >>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>>> >>>> Which can be analyzed as follows: >>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >>>> a pending-interrupt of 0xd2. >>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >>>> notes that interrupt 0xd2 was reinjected and therefore calls >>>> vmx_inject_irq() and returns. Without checking for pending L1 events! >>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >>>> before calling inject_pending_event(). >>>> (4) L0 resumes L2 without immediate-exit even though there is a pending >>>> L1 event (The IPI pending in LAPIC's IRR). >>>> >>>> We have already reached the buggy scenario but events could be >>>> furthered analyzed: >>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >>>> event-delivery. >>>> (7) L0 decides to forward the VMExit to L1 for further handling. >>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered >>>> into L1! >>>> >>>> Signed-off-by: Liran Alon <liran.alon@oracle.com> >>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>>> --- >>>> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >>>> 1 file changed, 20 insertions(+), 13 deletions(-) >>>> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>> index a4b5a013064b..63359ab0e798 100644 >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> int r; >>>> >>>> /* try to reinject previous events if any */ >>>> - if (vcpu->arch.exception.injected) { >>>> - kvm_x86_ops->queue_exception(vcpu); >>>> - return 0; >>>> - } >>>> >>>> + if (vcpu->arch.exception.injected) >>>> + kvm_x86_ops->queue_exception(vcpu); >>>> /* >>>> * NMI/interrupt must not be injected if an exception is >>>> * pending, because the latter may have been queued by >>>> * handling exit due to NMI/interrupt delivery. >>>> */ >>>> - if (!vcpu->arch.exception.pending) { >>>> - if (vcpu->arch.nmi_injected) { >>>> + else if (!vcpu->arch.exception.pending) { >>>> + if (vcpu->arch.nmi_injected) >>>> kvm_x86_ops->set_nmi(vcpu); >>>> - return 0; >>>> - } >>>> - >>>> - if (vcpu->arch.interrupt.injected) { >>>> + else if (vcpu->arch.interrupt.injected) >>>> kvm_x86_ops->set_irq(vcpu); >>>> - return 0; >>>> - } >>>> } >>>> >>>> + /* >>>> + * Call check_nested_events() even if we reinjected a previous >>>> event >>>> + * in order for caller to determine if it should require >>>> immediate-exit >>>> + * from L2 to L1 due to pending L1 events which require exit >>>> + * from L2 to L1. >>>> + */ >>>> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >>>> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >>>> if (r != 0) >>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> >>>> vcpu->arch.exception.has_error_code, >>>> >>>> vcpu->arch.exception.error_code); >>>> >>>> + WARN_ON_ONCE(vcpu->arch.exception.injected); >>>> vcpu->arch.exception.pending = false; >>>> vcpu->arch.exception.injected = true; >>>> >>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> } >>>> >>>> kvm_x86_ops->queue_exception(vcpu); >>>> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>>> kvm_x86_ops->smi_allowed(vcpu)) { >>>> + } >>>> + >>>> + /* Don't consider new event if we re-injected an event */ >>>> + if (kvm_event_needs_reinjection(vcpu)) >>>> + return 0; >>>> + >>>> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>>> + kvm_x86_ops->smi_allowed(vcpu)) { >>>> vcpu->arch.smi_pending = false; >>>> enter_smm(vcpu); >>>> } else if (vcpu->arch.nmi_pending && >>>> kvm_x86_ops->nmi_allowed(vcpu)) { >>>> -- >>>> 1.9.1 >>>> >>
On 28/11/17 13:14, Paolo Bonzini wrote: > On 28/11/2017 05:55, Jim Mattson wrote: >> kvm_queue_interrupt() begins as follows: >> vcpu->arch.interrupt.pending = true; >> >> kvm_cpu_has_interrupt() begins as follows: >> if (!lapic_in_kernel(v)) >> return v->arch.interrupt.pending; >> >> In the referenced [patch 2/8], you change interrupt.pending to >> interrupt.injected, but the same field is still referenced by these >> two functions. > > We cannot remove the !lapic_in_kernel(v) case, but it's okay if we > restrict nested VMX/SVM in CPUID when it is disabled (that is, check for > !lapic_in_kernel in nested_svm_check_permissions and nested_vmx_allowed, > so that setting VMXE and SVME will fail). I agree with this suggestion. I think it is best to currently make a commit before ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected"), that will add FIXME comments in kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr() specifying why it is problematic that they use interrupt.pending in nVMX/nSVM scenarios and in addition put the above checks Paolo mentioned in nested_svm_check_permissions() and nested_vmx_allowed(). Regards, -Liran > > Thanks, > > Paolo >
One other thing to note is that there is no place to put the interrupted IDT vectoring information for L2 virtual interrupt 210 when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt 252. I see that vmcs12_save_pending_event() shoves this information into the VMCS12 IDT-vectoring information field, but it should not do so when the VMCS12 exit reason is "exception or non-maskable interrupt." From the SDM, volume 3, section 27.2.3: Section 24.9.3 defined fields containing information for VM exits that occur while delivering an event through the IDT and as a result of any of the following cases: o A fault occurs during event delivery and causes a VM exit (because the bit associated with the fault is set to 1 in the exception bitmap). o A task switch is invoked through a task gate in the IDT. The VM exit occurs due to the task switch only after the initial checks of the task switch pass (see Section 25.4.2). o Event delivery causes an APIC-access VM exit (see Section 29.4). o An EPT violation, EPT misconfiguration, or page-modification log-full event that occurs during event delivery. For all other exit reasons (including "exception or non-maskable interrupt"), the valid bit of the IDT-vectoring info field should be clear. On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote: > On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >> reinject before next resume of L2. >> >> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >> to the L1 vCPU which currently runs L2 and exited to L0. >> >> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >> it will note that a previous event was re-injected to L2 (by >> IDT-vectoring-info) and therefore won't check if there are pending L1 >> events which require exit from L2 to L1. Thus, L0 enters L2 without >> immediate VMExit even though there are pending L1 events! >> >> This commit fixes the issue by making sure to check for L1 pending >> events even if a previous event was reinjected to L2 and bailing out >> from inject_pending_event() before evaluating a new pending event in >> case an event was already reinjected. >> >> The bug was observed by the following setup: >> * L0 is a 64CPU machine which runs KVM. >> * L1 is a 16CPU machine which runs KVM. >> * L0 & L1 runs with APICv disabled. >> (Also reproduced with APICv enabled but easier to analyze below info >> with APICv disabled) >> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >> During L2 boot, L1 hangs completely and analyzing the hang reveals that >> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >> that he has sent for another L1 vCPU. And all other L1 vCPUs are >> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >> forever (as L1 runs with kernel-preemption disabled). >> >> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >> series of events: >> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 > > This looks like an EPT violation for writing a stack page that wasn't > present in the vmcs02 shadow EPT tables. I assume that L0 filled in > the missing shadow EPT table entry (or entries) and dismissed this EPT > violation itself rather than forwarding it to L1, because L1's EPT > tables have a valid entry for this L2 guest-physical address. > >> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >> vec 252 (Fixed|edge) > > This is the source of the bug, right here. An interrupt can only be > accepted at an instruction boundary, and the virtual CPU is in the > midst of IDT-vectoring, which is not at an instruction boundary. The > EPT violation from L2 to L0 is an L0 artifact that should be > completely invisible to L1/L2. Interrupt 252 should remain pending > until we are at the first instruction of the IDT handler for interrupt > 210. > > To be more precise, if interrupt 210 was delivered via VM-entry event > injection on L1 VM-entry to L2, then we need to: > a) complete the event injection (i.e. vector through the L2 IDT for vector 210), > b) handle virtual VMX-preemption timer expiration during VM-entry (if any), > c) handle NMI-window exiting events (if any), > d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the > L2 IDT, depending on the NMI-exiting control), > e) handle interrupt-window exiting events, > f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by > vectoring through the L2 IDT, depending on the external-iinterrupt > exiting control). > ... > > Only when we get to step (f) can we accept a new IRQ from L0's local APIC. > >> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >> rip 0xffffe00069202690 info 83 0 >> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >> ext_int: 0x00000000 ext_int_err: 0x00000000 >> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> >> Which can be analyzed as follows: >> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >> a pending-interrupt of 0xd2. >> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >> notes that interrupt 0xd2 was reinjected and therefore calls >> vmx_inject_irq() and returns. Without checking for pending L1 events! >> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >> before calling inject_pending_event(). >> (4) L0 resumes L2 without immediate-exit even though there is a pending >> L1 event (The IPI pending in LAPIC's IRR). >> >> We have already reached the buggy scenario but events could be >> furthered analyzed: >> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >> event-delivery. >> (7) L0 decides to forward the VMExit to L1 for further handling. >> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >> LAPIC's IRR is not examined and therefore the IPI is still not delivered >> into L1! >> >> Signed-off-by: Liran Alon <liran.alon@oracle.com> >> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> --- >> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >> 1 file changed, 20 insertions(+), 13 deletions(-) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index a4b5a013064b..63359ab0e798 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> int r; >> >> /* try to reinject previous events if any */ >> - if (vcpu->arch.exception.injected) { >> - kvm_x86_ops->queue_exception(vcpu); >> - return 0; >> - } >> >> + if (vcpu->arch.exception.injected) >> + kvm_x86_ops->queue_exception(vcpu); >> /* >> * NMI/interrupt must not be injected if an exception is >> * pending, because the latter may have been queued by >> * handling exit due to NMI/interrupt delivery. >> */ >> - if (!vcpu->arch.exception.pending) { >> - if (vcpu->arch.nmi_injected) { >> + else if (!vcpu->arch.exception.pending) { >> + if (vcpu->arch.nmi_injected) >> kvm_x86_ops->set_nmi(vcpu); >> - return 0; >> - } >> - >> - if (vcpu->arch.interrupt.injected) { >> + else if (vcpu->arch.interrupt.injected) >> kvm_x86_ops->set_irq(vcpu); >> - return 0; >> - } >> } >> >> + /* >> + * Call check_nested_events() even if we reinjected a previous event >> + * in order for caller to determine if it should require immediate-exit >> + * from L2 to L1 due to pending L1 events which require exit >> + * from L2 to L1. >> + */ >> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >> if (r != 0) >> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> vcpu->arch.exception.has_error_code, >> vcpu->arch.exception.error_code); >> >> + WARN_ON_ONCE(vcpu->arch.exception.injected); >> vcpu->arch.exception.pending = false; >> vcpu->arch.exception.injected = true; >> >> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> } >> >> kvm_x86_ops->queue_exception(vcpu); >> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >> + } >> + >> + /* Don't consider new event if we re-injected an event */ >> + if (kvm_event_needs_reinjection(vcpu)) >> + return 0; >> + >> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >> + kvm_x86_ops->smi_allowed(vcpu)) { >> vcpu->arch.smi_pending = false; >> enter_smm(vcpu); >> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >> -- >> 1.9.1 >>
On 28/11/17 08:39, Jim Mattson wrote: > On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >> reinject before next resume of L2. >> >> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >> to the L1 vCPU which currently runs L2 and exited to L0. >> >> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >> it will note that a previous event was re-injected to L2 (by >> IDT-vectoring-info) and therefore won't check if there are pending L1 >> events which require exit from L2 to L1. Thus, L0 enters L2 without >> immediate VMExit even though there are pending L1 events! >> >> This commit fixes the issue by making sure to check for L1 pending >> events even if a previous event was reinjected to L2 and bailing out >> from inject_pending_event() before evaluating a new pending event in >> case an event was already reinjected. >> >> The bug was observed by the following setup: >> * L0 is a 64CPU machine which runs KVM. >> * L1 is a 16CPU machine which runs KVM. >> * L0 & L1 runs with APICv disabled. >> (Also reproduced with APICv enabled but easier to analyze below info >> with APICv disabled) >> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >> During L2 boot, L1 hangs completely and analyzing the hang reveals that >> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >> that he has sent for another L1 vCPU. And all other L1 vCPUs are >> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >> forever (as L1 runs with kernel-preemption disabled). >> >> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >> series of events: >> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 > > This looks like an EPT violation for writing a stack page that wasn't > present in the vmcs02 shadow EPT tables. I assume that L0 filled in > the missing shadow EPT table entry (or entries) and dismissed this EPT > violation itself rather than forwarding it to L1, because L1's EPT > tables have a valid entry for this L2 guest-physical address. Yes. This is how I see it as well. One can see in (4) that L0 resumed L2 without exiting to L1. > >> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >> vec 252 (Fixed|edge) > > This is the source of the bug, right here. An interrupt can only be > accepted at an instruction boundary, and the virtual CPU is in the > midst of IDT-vectoring, which is not at an instruction boundary. The > EPT violation from L2 to L0 is an L0 artifact that should be > completely invisible to L1/L2. Interrupt 252 should remain pending > until we are at the first instruction of the IDT handler for interrupt > 210. The trace kvm_apic_accept_irq is printed from __apic_accept_irq(). This function is called by IPI *sender* CPU after it has found the vCPU LAPIC to which an IPI should be delivered to. See how function is being called in the following code path: kvm_lapic_reg_write() APIC_ICR handler -> apic_send_ipi() -> kvm_irq_delivery_to_apic() -> kvm_apic_set_irq() -> __apic_accept_irq(). In addition, following __apic_accept_irq() APIC_DM_FIXED handler, one can see how the IPI is actually delivered to dest CPU. In this case APICv is disabled and therefore IPI is basically delivered by setting relevant bit in dest LAPIC IRR, setting KVM_REQ_EVENT and kicking the dest vCPU in case it is blocked / inside guest. This will cause dest CPU to eventually reach vcpu_enter_guest() which will consume KVM_REQ_EVENT and call inject_pending_event(). Only then it should be noticed that kvm_cpu_has_injectable_intr() returns true (as bit is set in LAPIC IRR) and then kvm_x86_ops->set_irq() is called to actually inject the IRQ to guest by writing it to VMCS. Therefore, this is not the source of the bug in my opinion. The interrupt isn't really accepted at an instruction boundary... > > To be more precise, if interrupt 210 was delivered via VM-entry event > injection on L1 VM-entry to L2, then we need to: > a) complete the event injection (i.e. vector through the L2 IDT for vector 210), > b) handle virtual VMX-preemption timer expiration during VM-entry (if any), > c) handle NMI-window exiting events (if any), > d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the > L2 IDT, depending on the NMI-exiting control), > e) handle interrupt-window exiting events, > f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by > vectoring through the L2 IDT, depending on the external-iinterrupt > exiting control). > ... > > Only when we get to step (f) can we accept a new IRQ from L0's local APIC. This is exactly what inject_pending_event() should be doing after this commit. It starts by checking if there is some event to "re-inject". Which is basically step (a). Then it calls check_nested_events() which basically handles steps (b)+(c)+(d)+(e). Then it evaluates if there is a pending exception to inject (it could have been queued during handling of exit due to event-delivery) and if yes it injects it to guest. Only then we check if we have already injected some event to guest. If yes, resume guest with the injected-event written in the VMCS already. Otherwise, continue evaluating more pending events such as pending L1 SMIs, NMIs or LAPIC interrupts and inject them if possible. The problem this patch attempt to fix is that before this patch, after performing step (a), if it indeed "re-injected" some event, we would not perform any evaluation of if there is some pending event which should require us to exit from L2 to L1. In other words, what happened here is that completing the "re-injection" of the event which was delivered and caused EPT_VIOLATION, blocked us from evaluating the pending L1 IPI which was sent by another CPU and therefore we missed an exit on EXTERNAL_INTERRUPT from L2 to L1. Regards, -Liran > >> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >> rip 0xffffe00069202690 info 83 0 >> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >> ext_int: 0x00000000 ext_int_err: 0x00000000 >> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >> >> Which can be analyzed as follows: >> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >> a pending-interrupt of 0xd2. >> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >> notes that interrupt 0xd2 was reinjected and therefore calls >> vmx_inject_irq() and returns. Without checking for pending L1 events! >> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >> before calling inject_pending_event(). >> (4) L0 resumes L2 without immediate-exit even though there is a pending >> L1 event (The IPI pending in LAPIC's IRR). >> >> We have already reached the buggy scenario but events could be >> furthered analyzed: >> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >> event-delivery. >> (7) L0 decides to forward the VMExit to L1 for further handling. >> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >> LAPIC's IRR is not examined and therefore the IPI is still not delivered >> into L1! >> >> Signed-off-by: Liran Alon <liran.alon@oracle.com> >> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> --- >> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >> 1 file changed, 20 insertions(+), 13 deletions(-) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index a4b5a013064b..63359ab0e798 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> int r; >> >> /* try to reinject previous events if any */ >> - if (vcpu->arch.exception.injected) { >> - kvm_x86_ops->queue_exception(vcpu); >> - return 0; >> - } >> >> + if (vcpu->arch.exception.injected) >> + kvm_x86_ops->queue_exception(vcpu); >> /* >> * NMI/interrupt must not be injected if an exception is >> * pending, because the latter may have been queued by >> * handling exit due to NMI/interrupt delivery. >> */ >> - if (!vcpu->arch.exception.pending) { >> - if (vcpu->arch.nmi_injected) { >> + else if (!vcpu->arch.exception.pending) { >> + if (vcpu->arch.nmi_injected) >> kvm_x86_ops->set_nmi(vcpu); >> - return 0; >> - } >> - >> - if (vcpu->arch.interrupt.injected) { >> + else if (vcpu->arch.interrupt.injected) >> kvm_x86_ops->set_irq(vcpu); >> - return 0; >> - } >> } >> >> + /* >> + * Call check_nested_events() even if we reinjected a previous event >> + * in order for caller to determine if it should require immediate-exit >> + * from L2 to L1 due to pending L1 events which require exit >> + * from L2 to L1. >> + */ >> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >> if (r != 0) >> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> vcpu->arch.exception.has_error_code, >> vcpu->arch.exception.error_code); >> >> + WARN_ON_ONCE(vcpu->arch.exception.injected); >> vcpu->arch.exception.pending = false; >> vcpu->arch.exception.injected = true; >> >> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >> } >> >> kvm_x86_ops->queue_exception(vcpu); >> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >> + } >> + >> + /* Don't consider new event if we re-injected an event */ >> + if (kvm_event_needs_reinjection(vcpu)) >> + return 0; >> + >> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >> + kvm_x86_ops->smi_allowed(vcpu)) { >> vcpu->arch.smi_pending = false; >> enter_smm(vcpu); >> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >> -- >> 1.9.1 >>
On 28/11/17 20:26, Jim Mattson wrote: > One other thing to note is that there is no place to put the > interrupted IDT vectoring information for L2 virtual interrupt 210 > when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt > 252. I see that vmcs12_save_pending_event() shoves this information > into the VMCS12 IDT-vectoring information field, but it should not do > so when the VMCS12 exit reason is "exception or non-maskable > interrupt." > > From the SDM, volume 3, section 27.2.3: > > Section 24.9.3 defined fields containing information for VM exits that > occur while delivering an event through the IDT and as a result of any > of the following cases: > o A fault occurs during event delivery and causes a VM exit (because > the bit associated with the fault is set to 1 in the exception > bitmap). > o A task switch is invoked through a task gate in the IDT. The VM > exit occurs due to the task switch only after the initial checks of > the task switch pass (see Section 25.4.2). > o Event delivery causes an APIC-access VM exit (see Section 29.4). > o An EPT violation, EPT misconfiguration, or page-modification > log-full event that occurs during event delivery. > > For all other exit reasons (including "exception or non-maskable > interrupt"), the valid bit of the IDT-vectoring info field should be > clear. Interesting. In the case described in the commit-message of this patch this is not relevant as because kvm_event_needs_reinjection()==true, vmx_check_nested_events() won't really do an exit from L2 to L1 on EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which will cause vcpu_enter_guest() to request an "immediate-exit" (via self-IPI. See code in vcpu_enter_guest() regarding req_immediate_exit variable for more info). This will make guest complete the event-injection of interrupt 210 but immediately afterwards exit to L0 which will then exit from L2 to L1 on EXTERNAL_INTERRUPT because of L1 interrupt 252. However, you made me notice there is another issue here in vmx_check_nested_events() handling. If the event-delivery would have caused a fault that needs to cause exit from L2 to L1, then currently vmx_check_nested_events() will see that kvm_event_needs_reinjection()==true and therefore not exit from L2 to L1 even though it should. So this is just another other bug to fix I think... Nice catch. -Liran > > On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote: >> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >>> reinject before next resume of L2. >>> >>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >>> to the L1 vCPU which currently runs L2 and exited to L0. >>> >>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >>> it will note that a previous event was re-injected to L2 (by >>> IDT-vectoring-info) and therefore won't check if there are pending L1 >>> events which require exit from L2 to L1. Thus, L0 enters L2 without >>> immediate VMExit even though there are pending L1 events! >>> >>> This commit fixes the issue by making sure to check for L1 pending >>> events even if a previous event was reinjected to L2 and bailing out >>> from inject_pending_event() before evaluating a new pending event in >>> case an event was already reinjected. >>> >>> The bug was observed by the following setup: >>> * L0 is a 64CPU machine which runs KVM. >>> * L1 is a 16CPU machine which runs KVM. >>> * L0 & L1 runs with APICv disabled. >>> (Also reproduced with APICv enabled but easier to analyze below info >>> with APICv disabled) >>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >>> During L2 boot, L1 hangs completely and analyzing the hang reveals that >>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >>> that he has sent for another L1 vCPU. And all other L1 vCPUs are >>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >>> forever (as L1 runs with kernel-preemption disabled). >>> >>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >>> series of events: >>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >> >> This looks like an EPT violation for writing a stack page that wasn't >> present in the vmcs02 shadow EPT tables. I assume that L0 filled in >> the missing shadow EPT table entry (or entries) and dismissed this EPT >> violation itself rather than forwarding it to L1, because L1's EPT >> tables have a valid entry for this L2 guest-physical address. >> >>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >>> vec 252 (Fixed|edge) >> >> This is the source of the bug, right here. An interrupt can only be >> accepted at an instruction boundary, and the virtual CPU is in the >> midst of IDT-vectoring, which is not at an instruction boundary. The >> EPT violation from L2 to L0 is an L0 artifact that should be >> completely invisible to L1/L2. Interrupt 252 should remain pending >> until we are at the first instruction of the IDT handler for interrupt >> 210. >> >> To be more precise, if interrupt 210 was delivered via VM-entry event >> injection on L1 VM-entry to L2, then we need to: >> a) complete the event injection (i.e. vector through the L2 IDT for vector 210), >> b) handle virtual VMX-preemption timer expiration during VM-entry (if any), >> c) handle NMI-window exiting events (if any), >> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the >> L2 IDT, depending on the NMI-exiting control), >> e) handle interrupt-window exiting events, >> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by >> vectoring through the L2 IDT, depending on the external-iinterrupt >> exiting control). >> ... >> >> Only when we get to step (f) can we accept a new IRQ from L0's local APIC. >> >>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >>> rip 0xffffe00069202690 info 83 0 >>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >>> ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> >>> Which can be analyzed as follows: >>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >>> a pending-interrupt of 0xd2. >>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >>> notes that interrupt 0xd2 was reinjected and therefore calls >>> vmx_inject_irq() and returns. Without checking for pending L1 events! >>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >>> before calling inject_pending_event(). >>> (4) L0 resumes L2 without immediate-exit even though there is a pending >>> L1 event (The IPI pending in LAPIC's IRR). >>> >>> We have already reached the buggy scenario but events could be >>> furthered analyzed: >>> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >>> event-delivery. >>> (7) L0 decides to forward the VMExit to L1 for further handling. >>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >>> LAPIC's IRR is not examined and therefore the IPI is still not delivered >>> into L1! >>> >>> Signed-off-by: Liran Alon <liran.alon@oracle.com> >>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>> --- >>> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >>> 1 file changed, 20 insertions(+), 13 deletions(-) >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index a4b5a013064b..63359ab0e798 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> int r; >>> >>> /* try to reinject previous events if any */ >>> - if (vcpu->arch.exception.injected) { >>> - kvm_x86_ops->queue_exception(vcpu); >>> - return 0; >>> - } >>> >>> + if (vcpu->arch.exception.injected) >>> + kvm_x86_ops->queue_exception(vcpu); >>> /* >>> * NMI/interrupt must not be injected if an exception is >>> * pending, because the latter may have been queued by >>> * handling exit due to NMI/interrupt delivery. >>> */ >>> - if (!vcpu->arch.exception.pending) { >>> - if (vcpu->arch.nmi_injected) { >>> + else if (!vcpu->arch.exception.pending) { >>> + if (vcpu->arch.nmi_injected) >>> kvm_x86_ops->set_nmi(vcpu); >>> - return 0; >>> - } >>> - >>> - if (vcpu->arch.interrupt.injected) { >>> + else if (vcpu->arch.interrupt.injected) >>> kvm_x86_ops->set_irq(vcpu); >>> - return 0; >>> - } >>> } >>> >>> + /* >>> + * Call check_nested_events() even if we reinjected a previous event >>> + * in order for caller to determine if it should require immediate-exit >>> + * from L2 to L1 due to pending L1 events which require exit >>> + * from L2 to L1. >>> + */ >>> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >>> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >>> if (r != 0) >>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> vcpu->arch.exception.has_error_code, >>> vcpu->arch.exception.error_code); >>> >>> + WARN_ON_ONCE(vcpu->arch.exception.injected); >>> vcpu->arch.exception.pending = false; >>> vcpu->arch.exception.injected = true; >>> >>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> } >>> >>> kvm_x86_ops->queue_exception(vcpu); >>> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >>> + } >>> + >>> + /* Don't consider new event if we re-injected an event */ >>> + if (kvm_event_needs_reinjection(vcpu)) >>> + return 0; >>> + >>> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>> + kvm_x86_ops->smi_allowed(vcpu)) { >>> vcpu->arch.smi_pending = false; >>> enter_smm(vcpu); >>> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >>> -- >>> 1.9.1 >>>
Thanks for the explanation (and for your patience)! Reviewed-by: Jim Mattson <jmattson@google.com> On Tue, Nov 28, 2017 at 11:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote: > > > On 28/11/17 20:26, Jim Mattson wrote: >> >> One other thing to note is that there is no place to put the >> interrupted IDT vectoring information for L2 virtual interrupt 210 >> when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt >> 252. I see that vmcs12_save_pending_event() shoves this information >> into the VMCS12 IDT-vectoring information field, but it should not do >> so when the VMCS12 exit reason is "exception or non-maskable >> interrupt." >> >> From the SDM, volume 3, section 27.2.3: >> >> Section 24.9.3 defined fields containing information for VM exits that >> occur while delivering an event through the IDT and as a result of any >> of the following cases: >> o A fault occurs during event delivery and causes a VM exit (because >> the bit associated with the fault is set to 1 in the exception >> bitmap). >> o A task switch is invoked through a task gate in the IDT. The VM >> exit occurs due to the task switch only after the initial checks of >> the task switch pass (see Section 25.4.2). >> o Event delivery causes an APIC-access VM exit (see Section 29.4). >> o An EPT violation, EPT misconfiguration, or page-modification >> log-full event that occurs during event delivery. >> >> For all other exit reasons (including "exception or non-maskable >> interrupt"), the valid bit of the IDT-vectoring info field should be >> clear. > > > Interesting. > > In the case described in the commit-message of this patch this is not > relevant as because kvm_event_needs_reinjection()==true, > vmx_check_nested_events() won't really do an exit from L2 to L1 on > EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which will > cause vcpu_enter_guest() to request an "immediate-exit" (via self-IPI. See > code in vcpu_enter_guest() regarding req_immediate_exit variable for more > info). This will make guest complete the event-injection of interrupt 210 > but immediately afterwards exit to L0 which will then exit from L2 to L1 on > EXTERNAL_INTERRUPT because of L1 interrupt 252. > > However, you made me notice there is another issue here in > vmx_check_nested_events() handling. If the event-delivery would have caused > a fault that needs to cause exit from L2 to L1, then currently > vmx_check_nested_events() will see that kvm_event_needs_reinjection()==true > and therefore not exit from L2 to L1 even though it should. > So this is just another other bug to fix I think... > > Nice catch. > -Liran > > >> >> On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote: >>> >>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> >>> wrote: >>>> >>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >>>> reinject before next resume of L2. >>>> >>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >>>> to the L1 vCPU which currently runs L2 and exited to L0. >>>> >>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >>>> it will note that a previous event was re-injected to L2 (by >>>> IDT-vectoring-info) and therefore won't check if there are pending L1 >>>> events which require exit from L2 to L1. Thus, L0 enters L2 without >>>> immediate VMExit even though there are pending L1 events! >>>> >>>> This commit fixes the issue by making sure to check for L1 pending >>>> events even if a previous event was reinjected to L2 and bailing out >>>> from inject_pending_event() before evaluating a new pending event in >>>> case an event was already reinjected. >>>> >>>> The bug was observed by the following setup: >>>> * L0 is a 64CPU machine which runs KVM. >>>> * L1 is a 16CPU machine which runs KVM. >>>> * L0 & L1 runs with APICv disabled. >>>> (Also reproduced with APICv enabled but easier to analyze below info >>>> with APICv disabled) >>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that >>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are >>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >>>> forever (as L1 runs with kernel-preemption disabled). >>>> >>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >>>> series of events: >>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >>> >>> >>> This looks like an EPT violation for writing a stack page that wasn't >>> present in the vmcs02 shadow EPT tables. I assume that L0 filled in >>> the missing shadow EPT table entry (or entries) and dismissed this EPT >>> violation itself rather than forwarding it to L1, because L1's EPT >>> tables have a valid entry for this L2 guest-physical address. >>> >>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >>>> vec 252 (Fixed|edge) >>> >>> >>> This is the source of the bug, right here. An interrupt can only be >>> accepted at an instruction boundary, and the virtual CPU is in the >>> midst of IDT-vectoring, which is not at an instruction boundary. The >>> EPT violation from L2 to L0 is an L0 artifact that should be >>> completely invisible to L1/L2. Interrupt 252 should remain pending >>> until we are at the first instruction of the IDT handler for interrupt >>> 210. >>> >>> To be more precise, if interrupt 210 was delivered via VM-entry event >>> injection on L1 VM-entry to L2, then we need to: >>> a) complete the event injection (i.e. vector through the L2 IDT for >>> vector 210), >>> b) handle virtual VMX-preemption timer expiration during VM-entry (if >>> any), >>> c) handle NMI-window exiting events (if any), >>> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the >>> L2 IDT, depending on the NMI-exiting control), >>> e) handle interrupt-window exiting events, >>> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by >>> vectoring through the L2 IDT, depending on the external-iinterrupt >>> exiting control). >>> ... >>> >>> Only when we get to step (f) can we accept a new IRQ from L0's local >>> APIC. >>> >>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >>>> rip 0xffffe00069202690 info 83 0 >>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >>>> ext_int: 0x00000000 ext_int_err: 0x00000000 >>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>>> >>>> Which can be analyzed as follows: >>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >>>> a pending-interrupt of 0xd2. >>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set >>>> KVM_REQ_EVENT. >>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >>>> notes that interrupt 0xd2 was reinjected and therefore calls >>>> vmx_inject_irq() and returns. Without checking for pending L1 events! >>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >>>> before calling inject_pending_event(). >>>> (4) L0 resumes L2 without immediate-exit even though there is a pending >>>> L1 event (The IPI pending in LAPIC's IRR). >>>> >>>> We have already reached the buggy scenario but events could be >>>> furthered analyzed: >>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >>>> event-delivery. >>>> (7) L0 decides to forward the VMExit to L1 for further handling. >>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered >>>> into L1! >>>> >>>> Signed-off-by: Liran Alon <liran.alon@oracle.com> >>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >>>> --- >>>> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >>>> 1 file changed, 20 insertions(+), 13 deletions(-) >>>> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>> index a4b5a013064b..63359ab0e798 100644 >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> int r; >>>> >>>> /* try to reinject previous events if any */ >>>> - if (vcpu->arch.exception.injected) { >>>> - kvm_x86_ops->queue_exception(vcpu); >>>> - return 0; >>>> - } >>>> >>>> + if (vcpu->arch.exception.injected) >>>> + kvm_x86_ops->queue_exception(vcpu); >>>> /* >>>> * NMI/interrupt must not be injected if an exception is >>>> * pending, because the latter may have been queued by >>>> * handling exit due to NMI/interrupt delivery. >>>> */ >>>> - if (!vcpu->arch.exception.pending) { >>>> - if (vcpu->arch.nmi_injected) { >>>> + else if (!vcpu->arch.exception.pending) { >>>> + if (vcpu->arch.nmi_injected) >>>> kvm_x86_ops->set_nmi(vcpu); >>>> - return 0; >>>> - } >>>> - >>>> - if (vcpu->arch.interrupt.injected) { >>>> + else if (vcpu->arch.interrupt.injected) >>>> kvm_x86_ops->set_irq(vcpu); >>>> - return 0; >>>> - } >>>> } >>>> >>>> + /* >>>> + * Call check_nested_events() even if we reinjected a previous >>>> event >>>> + * in order for caller to determine if it should require >>>> immediate-exit >>>> + * from L2 to L1 due to pending L1 events which require exit >>>> + * from L2 to L1. >>>> + */ >>>> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >>>> r = kvm_x86_ops->check_nested_events(vcpu, >>>> req_int_win); >>>> if (r != 0) >>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> >>>> vcpu->arch.exception.has_error_code, >>>> >>>> vcpu->arch.exception.error_code); >>>> >>>> + WARN_ON_ONCE(vcpu->arch.exception.injected); >>>> vcpu->arch.exception.pending = false; >>>> vcpu->arch.exception.injected = true; >>>> >>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu >>>> *vcpu, bool req_int_win) >>>> } >>>> >>>> kvm_x86_ops->queue_exception(vcpu); >>>> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>>> kvm_x86_ops->smi_allowed(vcpu)) { >>>> + } >>>> + >>>> + /* Don't consider new event if we re-injected an event */ >>>> + if (kvm_event_needs_reinjection(vcpu)) >>>> + return 0; >>>> + >>>> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>>> + kvm_x86_ops->smi_allowed(vcpu)) { >>>> vcpu->arch.smi_pending = false; >>>> enter_smm(vcpu); >>>> } else if (vcpu->arch.nmi_pending && >>>> kvm_x86_ops->nmi_allowed(vcpu)) { >>>> -- >>>> 1.9.1 >>>> >
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a4b5a013064b..63359ab0e798 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) int r; /* try to reinject previous events if any */ - if (vcpu->arch.exception.injected) { - kvm_x86_ops->queue_exception(vcpu); - return 0; - } + if (vcpu->arch.exception.injected) + kvm_x86_ops->queue_exception(vcpu); /* * NMI/interrupt must not be injected if an exception is * pending, because the latter may have been queued by * handling exit due to NMI/interrupt delivery. */ - if (!vcpu->arch.exception.pending) { - if (vcpu->arch.nmi_injected) { + else if (!vcpu->arch.exception.pending) { + if (vcpu->arch.nmi_injected) kvm_x86_ops->set_nmi(vcpu); - return 0; - } - - if (vcpu->arch.interrupt.injected) { + else if (vcpu->arch.interrupt.injected) kvm_x86_ops->set_irq(vcpu); - return 0; - } } + /* + * Call check_nested_events() even if we reinjected a previous event + * in order for caller to determine if it should require immediate-exit + * from L2 to L1 due to pending L1 events which require exit + * from L2 to L1. + */ if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); if (r != 0) @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) vcpu->arch.exception.has_error_code, vcpu->arch.exception.error_code); + WARN_ON_ONCE(vcpu->arch.exception.injected); vcpu->arch.exception.pending = false; vcpu->arch.exception.injected = true; @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) } kvm_x86_ops->queue_exception(vcpu); - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { + } + + /* Don't consider new event if we re-injected an event */ + if (kvm_event_needs_reinjection(vcpu)) + return 0; + + if (vcpu->arch.smi_pending && !is_smm(vcpu) && + kvm_x86_ops->smi_allowed(vcpu)) { vcpu->arch.smi_pending = false; enter_smm(vcpu); } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {