Message ID | 1473919908-20653-1-git-send-email-david@gibson.dropbear.id.au (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 15 Sep 2016 16:11:48 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup > of kvmchip routing configuration, that was mostly intended for x86. > However, it also contains a subtle change in behaviour which breaks EEH[1] > error recovery on certain VFIO passthrough devices on spapr guests. So far > it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be > other hardware with the same problem. It's also possible there could be > circumstances where it causes a bug on x86 as well, though I don't know of > any obvious candidates. > > Prior to d1f6af6, both vfio_msix_vector_do_use() and > vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this > as the "dummy" vector used to make the host hardware state sync with the > guest expected hardware state in terms of MSI configuration. > > Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op, > meaning the dummy irq would always be delivered via qemu. d1f6af6 changed > vfio_add_kvm_msi_virq() so it takes a vector number instead of the msg > parameter, and determines the correct message itself. The test for !msg > was removed, and not replaced with anything there or in the caller. > > With an spapr guest which has a VFIO device, if an EEH error occurs on the > host hardware, then the device will be isolated then reset. This is a > combination of host and guest action, mediated by some EEH related > hypercalls. I haven't fully traced the mechanics, but somehow installing > the kvm irqchip route for the dummy irq on the BCM5719 means that after EEH > reset and recovery, at least some irqs are no longer delivered to the > guest. > > In particular, the guest never gets the link up event, and so the NIC is > effectively dead. I suspect what's happening is that we're getting a virq setup with something bogus about the state and then we reuse that bogus state when the first vector is actually configured and it probably never works. Anyway, I'm concerned that d1f6af6 is going to cause a regression with b0223e29afdc ("vfio-pci: Make host MSI-X enable track guest"), which is where that dummy NULL msg call comes from, so I'm going to mark this for stable. Thanks! Alex > > [1] EEH (Enhanced Error Handling) is an IBM POWER server specific PCI-* > error reporting and recovery mechanism. The concept is somewhat > similar to PCI-E AER, but the details are different. > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1373802 > > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Gavin Shan <gwshan@au1.ibm.com> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > hw/vfio/pci.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 7bfa17c..a5a620a 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -496,7 +496,9 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, > vfio_update_kvm_msi_virq(vector, *msg, pdev); > } > } else { > - vfio_add_kvm_msi_virq(vdev, vector, nr, true); > + if (msg) { > + vfio_add_kvm_msi_virq(vdev, vector, nr, true); > + } > } > > /*
Good to know that we have a solution. :) I think this is a good fix, however I still do not understand why this is happening... Please see below comment. On Thu, Sep 15, 2016 at 04:11:48PM +1000, David Gibson wrote: > d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup > of kvmchip routing configuration, that was mostly intended for x86. > However, it also contains a subtle change in behaviour which breaks EEH[1] > error recovery on certain VFIO passthrough devices on spapr guests. So far > it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be > other hardware with the same problem. It's also possible there could be > circumstances where it causes a bug on x86 as well, though I don't know of > any obvious candidates. > > Prior to d1f6af6, both vfio_msix_vector_do_use() and > vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this > as the "dummy" vector used to make the host hardware state sync with the > guest expected hardware state in terms of MSI configuration. > > Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op, > meaning the dummy irq would always be delivered via qemu. AFAICT, QEMU is not delivering that *dummy* IRQ as well. IIUC here the dummy IRQ refers to the one in vfio_msix_enable(): vfio_msix_vector_do_use(&vdev->pdev, 0, NULL, NULL); vfio_msix_vector_release(&vdev->pdev, 0); Here we registered this dummy one to sync up the guest and host hardware state for some reason. However, the handler is NULL here, means that even QEMU won't handle it if it happens. And the only difference is that during this extremely small period, kvm will handle the interrupt with an uninitialized MSI message, but assuming that would never happen since no one should start to use MSI yet. After release(), interrupts will be dropped for all cases. So looks like it is not related to "whether QEMU or KVM will handle the interrupt", but something else. Generally speaking, the process of registering the IRQ should be something like (using QEMU with d1f6af6 change): 1. vfio_msix_enable(): when MSIX is enabled. this will trigger vfio_add_kvm_msi_virq(), but quickly it is unregistered (virq will be there, but VFIO will assign the VFIO handler to vector->interrupt, rather than vector->kvm_interrupt, so that virq won't take effect). 2. vfio_msix_vector_use(): when an MSIX entry is finally used and setup. this will trigger vfio_update_kvm_msi_virq(), to update the interrupt information to the "real one". IMHO we should always receive interrupts after step (2). And as we have traced (with David), step (2) was updating the correct interrupt information even with commit d1f6af6. But I think I might be wrong (or say, the above assumption is not correct), because commit d1f6af6 indeed caused problem with EHH and this specific hardware. I am just do not know why it is triggering the issue. And that's why I want to know whether there is anything special with that NIC's driver. Again, this is totally a discussion only, and I am totally agree with current change to fix the issue. Just in case someone knows the reason behind this. Thanks, -- peterx
On Sun, Sep 18, 2016 at 01:33:27PM +0800, Peter Xu wrote: > Good to know that we have a solution. :) > > I think this is a good fix, however I still do not understand why this > is happening... Please see below comment. > > On Thu, Sep 15, 2016 at 04:11:48PM +1000, David Gibson wrote: > > d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup > > of kvmchip routing configuration, that was mostly intended for x86. > > However, it also contains a subtle change in behaviour which breaks EEH[1] > > error recovery on certain VFIO passthrough devices on spapr guests. So far > > it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be > > other hardware with the same problem. It's also possible there could be > > circumstances where it causes a bug on x86 as well, though I don't know of > > any obvious candidates. > > > > Prior to d1f6af6, both vfio_msix_vector_do_use() and > > vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this > > as the "dummy" vector used to make the host hardware state sync with the > > guest expected hardware state in terms of MSI configuration. > > > > Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op, > > meaning the dummy irq would always be delivered via qemu. > > AFAICT, QEMU is not delivering that *dummy* IRQ as well. IIUC here the > dummy IRQ refers to the one in vfio_msix_enable(): > > vfio_msix_vector_do_use(&vdev->pdev, 0, NULL, NULL); > vfio_msix_vector_release(&vdev->pdev, 0); > > Here we registered this dummy one to sync up the guest and host > hardware state for some reason. However, the handler is NULL here, > means that even QEMU won't handle it if it happens. And the only > difference is that during this extremely small period, kvm will handle > the interrupt with an uninitialized MSI message, but assuming that > would never happen since no one should start to use MSI yet. After > release(), interrupts will be dropped for all cases. > > So looks like it is not related to "whether QEMU or KVM will handle > the interrupt", but something else. > > Generally speaking, the process of registering the IRQ should be > something like (using QEMU with d1f6af6 change): > > 1. vfio_msix_enable(): when MSIX is enabled. this will trigger > vfio_add_kvm_msi_virq(), but quickly it is unregistered (virq will > be there, but VFIO will assign the VFIO handler to > vector->interrupt, rather than vector->kvm_interrupt, so that virq > won't take effect). > > 2. vfio_msix_vector_use(): when an MSIX entry is finally used and > setup. this will trigger vfio_update_kvm_msi_virq(), to update the > interrupt information to the "real one". > > IMHO we should always receive interrupts after step (2). And as we > have traced (with David), step (2) was updating the correct interrupt > information even with commit d1f6af6. But I think I might be wrong (or > say, the above assumption is not correct), because commit d1f6af6 > indeed caused problem with EHH and this specific hardware. I am just > do not know why it is triggering the issue. And that's why I want to > know whether there is anything special with that NIC's driver. It's quite possible there's something unusual about the driver - this didn't happen for the other NIC I tried, or for an XHCI. Unfortunately, I don't know what it would be, and I really have no idea where one would start looking. > Again, this is totally a discussion only, and I am totally agree with > current change to fix the issue. Just in case someone knows the reason > behind this. > > Thanks, > > -- peterx >
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 7bfa17c..a5a620a 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -496,7 +496,9 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, vfio_update_kvm_msi_virq(vector, *msg, pdev); } } else { - vfio_add_kvm_msi_virq(vdev, vector, nr, true); + if (msg) { + vfio_add_kvm_msi_virq(vdev, vector, nr, true); + } } /*
d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup of kvmchip routing configuration, that was mostly intended for x86. However, it also contains a subtle change in behaviour which breaks EEH[1] error recovery on certain VFIO passthrough devices on spapr guests. So far it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be other hardware with the same problem. It's also possible there could be circumstances where it causes a bug on x86 as well, though I don't know of any obvious candidates. Prior to d1f6af6, both vfio_msix_vector_do_use() and vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this as the "dummy" vector used to make the host hardware state sync with the guest expected hardware state in terms of MSI configuration. Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op, meaning the dummy irq would always be delivered via qemu. d1f6af6 changed vfio_add_kvm_msi_virq() so it takes a vector number instead of the msg parameter, and determines the correct message itself. The test for !msg was removed, and not replaced with anything there or in the caller. With an spapr guest which has a VFIO device, if an EEH error occurs on the host hardware, then the device will be isolated then reset. This is a combination of host and guest action, mediated by some EEH related hypercalls. I haven't fully traced the mechanics, but somehow installing the kvm irqchip route for the dummy irq on the BCM5719 means that after EEH reset and recovery, at least some irqs are no longer delivered to the guest. In particular, the guest never gets the link up event, and so the NIC is effectively dead. [1] EEH (Enhanced Error Handling) is an IBM POWER server specific PCI-* error reporting and recovery mechanism. The concept is somewhat similar to PCI-E AER, but the details are different. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1373802 Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Gavin Shan <gwshan@au1.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- hw/vfio/pci.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)