diff mbox

kvm-unit-test fail for split irqchip

Message ID 20160913190145.GE15680@potion (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář Sept. 13, 2016, 7:01 p.m. UTC
2016-09-13 19:06+0800, Wanpeng Li:
> Add -kernel_irqchip=split
> ./x86-run x86/eventinj.flat
> 
> qemu-system-x86_64 -enable-kvm -machine kernel_irqchip=split -cpu host
> -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc
> none -serial stdio -device pci-testdev -kernel x86/eventinj.flat
> enabling apic
> paging enabled
> cr0 = 80010011
> cr3 = 7fff000
> cr4 = 20
> Sending vec 33 and 62 and mask one with TPR
> irq1 running
> irq1 running
> After 33/62 TPR test
> FAIL: TPR
> irq0 running
> irq0 running
> 
> Both irq1 and irq0 are executing twice.
> 
>  qemu-system-x86-22794 [001] d..2 34591.708476: kvm_entry: vcpu 0
>  qemu-system-x86-22794 [001] ...1 34591.708478: kvm_exit: reason
> MSR_WRITE rip 0x401f33 info 0 0
>  qemu-system-x86-22794 [001] ...1 34591.708478: kvm_apic: apic_write
> APIC_EOI = 0x0
>  qemu-system-x86-22794 [001] ...1 34591.708479: kvm_eoi: apicid 0 vector 62
>  qemu-system-x86-22794 [001] ...1 34591.708479: kvm_msr: msr_write 80b = 0x0
>  qemu-system-x86-22794 [001] d..2 34591.708480: kvm_entry: vcpu 0
>  qemu-system-x86-22794 [001] ...1 34591.708482: kvm_exit: reason
> PENDING_INTERRUPT rip 0x401f35 info 0 0
>  qemu-system-x86-22794 [001] ...1 34591.708482: kvm_userspace_exit:
> reason KVM_EXIT_IRQ_WINDOW_OPEN (7)
>  qemu-system-x86-22794 [001] ...1 34591.708491: kvm_inj_virq: irq 62
>  qemu-system-x86-22794 [001] d..2 34591.708492: kvm_entry: vcpu 0
>  qemu-system-x86-22794 [001] ...1 34591.708493: kvm_exit: reason
> IO_INSTRUCTION rip 0x4016ec info 3fd0008 0
> 
> From the trace we can see there is an interrupt window exit after the
> first interrupt EOI(irq 62), and the same irq(62) is injected
> duplicately after the interrupt window.
> 
> The bug can disappear if kernel_irqchip is on or -x2apic, the virtual
> x2apic mode will not be set due to commit (8d14695f9542 x86, apicv:
> add virtual x2apic support), so that tpr shadow in the x2apic doesn't
> work and wrmsr TPR register will trigger vmexit, and then kvmvapic
> will be used to optimize flexpriority=N or AMD. The
> report_trp_access() which is called in kvm_lapic_reg_write() will
> trigger a userspace exit.
> 
> TPR report access callbacks in qemu, kvm_handle_tpr_access() ->
> apic_handle_tpr_access_report() -> vapic_report_tpr_access() ->
> cpu_synchronize_state() will get apic register states from kvm.
> 
> Later, kvm_arch_pre_run -> cpu_get_pic_interrupt(if there is a pic
> interrupt) -> apic_get_interrupt,  it is a pic interrupt, however it
> gets the stale irq from apic register sync by report tpr access and
> KVM_INTERRUPT the second duplicate interrupt.
> 
> Paolo pointed out it is not the TPR associated bug, and we should
> figure out why there is an interrupt window exit after the first EOI.

Yeah, seems like QEMU bug.  QEMU does KVM_INTERRUPT(62) ioctl after KVM
exits with KVM_EXIT_IRQ_WINDOW_OPEN, which QEMU requested while the
guest was printing.  The printing calls

  serial_update_irq() -> qemu_irq_lower() -> qemu_set_irq() ->
  gsi_handler() -> qemu_set_irq() -> pic_irq_request() ->
  apic_deliver_pic_intr() -> kvm_handle_interrupt()

kvm_handle_interrupt() does

  interrupt_request |= CPU_INTERRUPT_HARD

which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
that function uses stale information from APIC and injects 62 again.
If we synchronized the APIC, then the test would #GP, because there
would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
would return 15, thinking it was spurious.

I think the bug starts in pic_irq_request(), which should not touch
LAPIC.  The patch below makes it work (just the second hunk is
sufficient), but it's still far from sane code ...



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini Sept. 13, 2016, 8:43 p.m. UTC | #1
On 13/09/2016 21:01, Radim Krcmar wrote:
> kvm_handle_interrupt() does
> 
>   interrupt_request |= CPU_INTERRUPT_HARD
> 
> which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
> that function uses stale information from APIC and injects 62 again.
> If we synchronized the APIC, then the test would #GP, because there
> would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
> would return 15, thinking it was spurious.
> 
> I think the bug starts in pic_irq_request(), which should not touch
> LAPIC.  The patch below makes it work (just the second hunk is
> sufficient), but it's still far from sane code ...

This makes sense.  Most of the functions exported by hw/intc/apic.c
should only be used with a userspace APIC:

0000000000000b70 T apic_accept_pic_intr
00000000000010f0 T apic_deliver_irq
00000000000011e0 T apic_deliver_pic_intr
0000000000001310 T apic_get_interrupt
0000000000001270 T apic_poll_irq
0000000000000a40 T apic_sipi

The patch is okay, though for consistency with other code I'd use
!kvm_irqchip_in_kernel() rather than !kvm_irqchip_is_split().

Wanpeng, can you do that, and change hw/intc/apic.c to use a new casting
macro

    APICCommonState *s = APIC(dev);

instead of APIC_COMMON?

Thanks,

Paolo


> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 47593b741a5b..6983e9f13813 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -161,13 +161,16 @@ int cpu_get_pic_interrupt(CPUX86State *env)
>      X86CPU *cpu = x86_env_get_cpu(env);
>      int intno;
>  
> -    intno = apic_get_interrupt(cpu->apic_state);
> -    if (intno >= 0) {
> -        return intno;
> -    }
> -    /* read the irq from the PIC */
> -    if (!apic_accept_pic_intr(cpu->apic_state)) {
> -        return -1;
> +    if (!kvm_irqchip_is_split()) {
> +        /* XXX: why query APIC at all? */
> +        intno = apic_get_interrupt(cpu->apic_state);
> +        if (intno >= 0) {
> +            return intno;
> +        }
> +        /* read the irq from the PIC */
> +        if (!apic_accept_pic_intr(cpu->apic_state)) {
> +            return -1;
> +        }
>      }
>  
>      intno = pic_read_irq(isa_pic);
> @@ -180,7 +183,7 @@ static void pic_irq_request(void *opaque, int irq, int level)
>      X86CPU *cpu = X86_CPU(cs);
>  
>      DPRINTF("pic_irqs: %s irq %d\n", level? "raise" : "lower", irq);
> -    if (cpu->apic_state) {
> +    if (cpu->apic_state && !kvm_irqchip_is_split()) {
>          CPU_FOREACH(cs) {
>              cpu = X86_CPU(cs);
>              if (apic_accept_pic_intr(cpu->apic_state)) {
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li Sept. 14, 2016, 3:52 a.m. UTC | #2
2016-09-14 3:01 GMT+08:00 Radim Krcmar <rkrcmar@redhat.com>:
> 2016-09-13 19:06+0800, Wanpeng Li:
>> Add -kernel_irqchip=split
>> ./x86-run x86/eventinj.flat
>>
>> qemu-system-x86_64 -enable-kvm -machine kernel_irqchip=split -cpu host
>> -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc
>> none -serial stdio -device pci-testdev -kernel x86/eventinj.flat
>> enabling apic
>> paging enabled
>> cr0 = 80010011
>> cr3 = 7fff000
>> cr4 = 20
>> Sending vec 33 and 62 and mask one with TPR
>> irq1 running
>> irq1 running
>> After 33/62 TPR test
>> FAIL: TPR
>> irq0 running
>> irq0 running
>>
>> Both irq1 and irq0 are executing twice.
>>
>>  qemu-system-x86-22794 [001] d..2 34591.708476: kvm_entry: vcpu 0
>>  qemu-system-x86-22794 [001] ...1 34591.708478: kvm_exit: reason
>> MSR_WRITE rip 0x401f33 info 0 0
>>  qemu-system-x86-22794 [001] ...1 34591.708478: kvm_apic: apic_write
>> APIC_EOI = 0x0
>>  qemu-system-x86-22794 [001] ...1 34591.708479: kvm_eoi: apicid 0 vector 62
>>  qemu-system-x86-22794 [001] ...1 34591.708479: kvm_msr: msr_write 80b = 0x0
>>  qemu-system-x86-22794 [001] d..2 34591.708480: kvm_entry: vcpu 0
>>  qemu-system-x86-22794 [001] ...1 34591.708482: kvm_exit: reason
>> PENDING_INTERRUPT rip 0x401f35 info 0 0
>>  qemu-system-x86-22794 [001] ...1 34591.708482: kvm_userspace_exit:
>> reason KVM_EXIT_IRQ_WINDOW_OPEN (7)
>>  qemu-system-x86-22794 [001] ...1 34591.708491: kvm_inj_virq: irq 62
>>  qemu-system-x86-22794 [001] d..2 34591.708492: kvm_entry: vcpu 0
>>  qemu-system-x86-22794 [001] ...1 34591.708493: kvm_exit: reason
>> IO_INSTRUCTION rip 0x4016ec info 3fd0008 0
>>
>> From the trace we can see there is an interrupt window exit after the
>> first interrupt EOI(irq 62), and the same irq(62) is injected
>> duplicately after the interrupt window.
>>
>> The bug can disappear if kernel_irqchip is on or -x2apic, the virtual
>> x2apic mode will not be set due to commit (8d14695f9542 x86, apicv:
>> add virtual x2apic support), so that tpr shadow in the x2apic doesn't
>> work and wrmsr TPR register will trigger vmexit, and then kvmvapic
>> will be used to optimize flexpriority=N or AMD. The
>> report_trp_access() which is called in kvm_lapic_reg_write() will
>> trigger a userspace exit.
>>
>> TPR report access callbacks in qemu, kvm_handle_tpr_access() ->
>> apic_handle_tpr_access_report() -> vapic_report_tpr_access() ->
>> cpu_synchronize_state() will get apic register states from kvm.
>>
>> Later, kvm_arch_pre_run -> cpu_get_pic_interrupt(if there is a pic
>> interrupt) -> apic_get_interrupt,  it is a pic interrupt, however it
>> gets the stale irq from apic register sync by report tpr access and
>> KVM_INTERRUPT the second duplicate interrupt.
>>
>> Paolo pointed out it is not the TPR associated bug, and we should
>> figure out why there is an interrupt window exit after the first EOI.
>
> Yeah, seems like QEMU bug.  QEMU does KVM_INTERRUPT(62) ioctl after KVM
> exits with KVM_EXIT_IRQ_WINDOW_OPEN, which QEMU requested while the
> guest was printing.  The printing calls
>
>   serial_update_irq() -> qemu_irq_lower() -> qemu_set_irq() ->
>   gsi_handler() -> qemu_set_irq() -> pic_irq_request() ->
>   apic_deliver_pic_intr() -> kvm_handle_interrupt()
>
> kvm_handle_interrupt() does
>
>   interrupt_request |= CPU_INTERRUPT_HARD
>
> which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
> that function uses stale information from APIC and injects 62 again.
> If we synchronized the APIC, then the test would #GP, because there
> would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
> would return 15, thinking it was spurious.
>
> I think the bug starts in pic_irq_request(), which should not touch
> LAPIC.  The patch below makes it work (just the second hunk is
> sufficient), but it's still far from sane code ...

Thanks for the analysis, Radim! :)

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li Sept. 14, 2016, 3:57 a.m. UTC | #3
2016-09-14 4:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 13/09/2016 21:01, Radim Krcmar wrote:
>> kvm_handle_interrupt() does
>>
>>   interrupt_request |= CPU_INTERRUPT_HARD
>>
>> which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
>> that function uses stale information from APIC and injects 62 again.
>> If we synchronized the APIC, then the test would #GP, because there
>> would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
>> would return 15, thinking it was spurious.
>>
>> I think the bug starts in pic_irq_request(), which should not touch
>> LAPIC.  The patch below makes it work (just the second hunk is
>> sufficient), but it's still far from sane code ...
>
> This makes sense.  Most of the functions exported by hw/intc/apic.c
> should only be used with a userspace APIC:
>
> 0000000000000b70 T apic_accept_pic_intr
> 00000000000010f0 T apic_deliver_irq
> 00000000000011e0 T apic_deliver_pic_intr
> 0000000000001310 T apic_get_interrupt
> 0000000000001270 T apic_poll_irq
> 0000000000000a40 T apic_sipi
>
> The patch is okay, though for consistency with other code I'd use
> !kvm_irqchip_in_kernel() rather than !kvm_irqchip_is_split().
>
> Wanpeng, can you do that,

Yeah, I just sent out a patch to fix the bug. Thanks for the long
discussion with me and thanks Radim's proposal. :)

> and change hw/intc/apic.c to use a new casting
> macro
>
>     APICCommonState *s = APIC(dev);
>
> instead of APIC_COMMON?
>

I'm not familiar with QOM too much, what APIC macro do you like?

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 14, 2016, 9:42 a.m. UTC | #4
On 14/09/2016 05:57, Wanpeng Li wrote:
> 2016-09-14 4:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>
>>
>> On 13/09/2016 21:01, Radim Krcmar wrote:
>>> kvm_handle_interrupt() does
>>>
>>>   interrupt_request |= CPU_INTERRUPT_HARD
>>>
>>> which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
>>> that function uses stale information from APIC and injects 62 again.
>>> If we synchronized the APIC, then the test would #GP, because there
>>> would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
>>> would return 15, thinking it was spurious.
>>>
>>> I think the bug starts in pic_irq_request(), which should not touch
>>> LAPIC.  The patch below makes it work (just the second hunk is
>>> sufficient), but it's still far from sane code ...
>>
>> This makes sense.  Most of the functions exported by hw/intc/apic.c
>> should only be used with a userspace APIC:
>>
>> 0000000000000b70 T apic_accept_pic_intr
>> 00000000000010f0 T apic_deliver_irq
>> 00000000000011e0 T apic_deliver_pic_intr
>> 0000000000001310 T apic_get_interrupt
>> 0000000000001270 T apic_poll_irq
>> 0000000000000a40 T apic_sipi
>>
>> The patch is okay, though for consistency with other code I'd use
>> !kvm_irqchip_in_kernel() rather than !kvm_irqchip_is_split().
>>
>> Wanpeng, can you do that,
> 
> Yeah, I just sent out a patch to fix the bug. Thanks for the long
> discussion with me and thanks Radim's proposal. :)
> 
>> and change hw/intc/apic.c to use a new casting
>> macro
>>
>>     APICCommonState *s = APIC(dev);
>>
>> instead of APIC_COMMON?
>>
> 
> I'm not familiar with QOM too much, what APIC macro do you like?

The macro would be defined like

#define TYPE_APIC "apic"
#define APIC(obj) \
     OBJECT_CHECK(APICCommonState, (obj), TYPE_APIC)

With this change and without your other patch, you should get an
assertion failure in eventinj.flat (which would have pointed immediately
at the bug!).

Thanks,

Paolo

> Regards,
> Wanpeng Li
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li Sept. 15, 2016, 12:58 a.m. UTC | #5
2016-09-14 17:42 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 14/09/2016 05:57, Wanpeng Li wrote:
>> 2016-09-14 4:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>
>>> On 13/09/2016 21:01, Radim Krcmar wrote:
>>>> kvm_handle_interrupt() does
>>>>
>>>>   interrupt_request |= CPU_INTERRUPT_HARD
>>>>
>>>> which later calls cpu_get_pic_interrupt() in kvm_arch_pre_run(), but
>>>> that function uses stale information from APIC and injects 62 again.
>>>> If we synchronized the APIC, then the test would #GP, because there
>>>> would be no injectable interrupt in LAPIC or PIC, so pic_read_irq()
>>>> would return 15, thinking it was spurious.
>>>>
>>>> I think the bug starts in pic_irq_request(), which should not touch
>>>> LAPIC.  The patch below makes it work (just the second hunk is
>>>> sufficient), but it's still far from sane code ...
>>>
>>> This makes sense.  Most of the functions exported by hw/intc/apic.c
>>> should only be used with a userspace APIC:
>>>
>>> 0000000000000b70 T apic_accept_pic_intr
>>> 00000000000010f0 T apic_deliver_irq
>>> 00000000000011e0 T apic_deliver_pic_intr
>>> 0000000000001310 T apic_get_interrupt
>>> 0000000000001270 T apic_poll_irq
>>> 0000000000000a40 T apic_sipi
>>>
>>> The patch is okay, though for consistency with other code I'd use
>>> !kvm_irqchip_in_kernel() rather than !kvm_irqchip_is_split().
>>>
>>> Wanpeng, can you do that,
>>
>> Yeah, I just sent out a patch to fix the bug. Thanks for the long
>> discussion with me and thanks Radim's proposal. :)
>>
>>> and change hw/intc/apic.c to use a new casting
>>> macro
>>>
>>>     APICCommonState *s = APIC(dev);
>>>
>>> instead of APIC_COMMON?
>>>
>>
>> I'm not familiar with QOM too much, what APIC macro do you like?
>
> The macro would be defined like
>
> #define TYPE_APIC "apic"
> #define APIC(obj) \
>      OBJECT_CHECK(APICCommonState, (obj), TYPE_APIC)
>
> With this change and without your other patch, you should get an
> assertion failure in eventinj.flat (which would have pointed immediately
> at the bug!).

Good point, I just sent out the patch, it catches the bug as you expected. :)

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 47593b741a5b..6983e9f13813 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -161,13 +161,16 @@  int cpu_get_pic_interrupt(CPUX86State *env)
     X86CPU *cpu = x86_env_get_cpu(env);
     int intno;
 
-    intno = apic_get_interrupt(cpu->apic_state);
-    if (intno >= 0) {
-        return intno;
-    }
-    /* read the irq from the PIC */
-    if (!apic_accept_pic_intr(cpu->apic_state)) {
-        return -1;
+    if (!kvm_irqchip_is_split()) {
+        /* XXX: why query APIC at all? */
+        intno = apic_get_interrupt(cpu->apic_state);
+        if (intno >= 0) {
+            return intno;
+        }
+        /* read the irq from the PIC */
+        if (!apic_accept_pic_intr(cpu->apic_state)) {
+            return -1;
+        }
     }
 
     intno = pic_read_irq(isa_pic);
@@ -180,7 +183,7 @@  static void pic_irq_request(void *opaque, int irq, int level)
     X86CPU *cpu = X86_CPU(cs);
 
     DPRINTF("pic_irqs: %s irq %d\n", level? "raise" : "lower", irq);
-    if (cpu->apic_state) {
+    if (cpu->apic_state && !kvm_irqchip_is_split()) {
         CPU_FOREACH(cs) {
             cpu = X86_CPU(cs);
             if (apic_accept_pic_intr(cpu->apic_state)) {