diff mbox

vfio: Fix regression in MSI routing configuration

Message ID 1473919908-20653-1-git-send-email-david@gibson.dropbear.id.au (mailing list archive)
State New, archived
Headers show

Commit Message

David Gibson Sept. 15, 2016, 6:11 a.m. UTC
d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup
of kvmchip routing configuration, that was mostly intended for x86.
However, it also contains a subtle change in behaviour which breaks EEH[1]
error recovery on certain VFIO passthrough devices on spapr guests.  So far
it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be
other hardware with the same problem.  It's also possible there could be
circumstances where it causes a bug on x86 as well, though I don't know of
any obvious candidates.

Prior to d1f6af6, both vfio_msix_vector_do_use() and
vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this
as the "dummy" vector used to make the host hardware state sync with the
guest expected hardware state in terms of MSI configuration.

Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op,
meaning the dummy irq would always be delivered via qemu. d1f6af6 changed
vfio_add_kvm_msi_virq() so it takes a vector number instead of the msg
parameter, and determines the correct message itself.  The test for !msg
was removed, and not replaced with anything there or in the caller.

With an spapr guest which has a VFIO device, if an EEH error occurs on the
host hardware, then the device will be isolated then reset.  This is a
combination of host and guest action, mediated by some EEH related
hypercalls.  I haven't fully traced the mechanics, but somehow installing
the kvm irqchip route for the dummy irq on the BCM5719 means that after EEH
reset and recovery, at least some irqs are no longer delivered to the
guest.

In particular, the guest never gets the link up event, and so the NIC is
effectively dead.

[1] EEH (Enhanced Error Handling) is an IBM POWER server specific PCI-*
    error reporting and recovery mechanism.  The concept is somewhat
    similar to PCI-E AER, but the details are different.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1373802

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Gavin Shan <gwshan@au1.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/vfio/pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Alex Williamson Sept. 15, 2016, 4:38 p.m. UTC | #1
On Thu, 15 Sep 2016 16:11:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup
> of kvmchip routing configuration, that was mostly intended for x86.
> However, it also contains a subtle change in behaviour which breaks EEH[1]
> error recovery on certain VFIO passthrough devices on spapr guests.  So far
> it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be
> other hardware with the same problem.  It's also possible there could be
> circumstances where it causes a bug on x86 as well, though I don't know of
> any obvious candidates.
> 
> Prior to d1f6af6, both vfio_msix_vector_do_use() and
> vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this
> as the "dummy" vector used to make the host hardware state sync with the
> guest expected hardware state in terms of MSI configuration.
> 
> Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op,
> meaning the dummy irq would always be delivered via qemu. d1f6af6 changed
> vfio_add_kvm_msi_virq() so it takes a vector number instead of the msg
> parameter, and determines the correct message itself.  The test for !msg
> was removed, and not replaced with anything there or in the caller.
> 
> With an spapr guest which has a VFIO device, if an EEH error occurs on the
> host hardware, then the device will be isolated then reset.  This is a
> combination of host and guest action, mediated by some EEH related
> hypercalls.  I haven't fully traced the mechanics, but somehow installing
> the kvm irqchip route for the dummy irq on the BCM5719 means that after EEH
> reset and recovery, at least some irqs are no longer delivered to the
> guest.
> 
> In particular, the guest never gets the link up event, and so the NIC is
> effectively dead.

I suspect what's happening is that we're getting a virq setup with
something bogus about the state and then we reuse that bogus state when
the first vector is actually configured and it probably never works.
Anyway, I'm concerned that d1f6af6 is going to cause a regression with
b0223e29afdc ("vfio-pci: Make host MSI-X enable track guest"), which is
where that dummy NULL msg call comes from, so I'm going to mark this for
stable.  Thanks!

Alex

> 
> [1] EEH (Enhanced Error Handling) is an IBM POWER server specific PCI-*
>     error reporting and recovery mechanism.  The concept is somewhat
>     similar to PCI-E AER, but the details are different.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1373802
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Gavin Shan <gwshan@au1.ibm.com>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/vfio/pci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7bfa17c..a5a620a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -496,7 +496,9 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>              vfio_update_kvm_msi_virq(vector, *msg, pdev);
>          }
>      } else {
> -        vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> +        if (msg) {
> +            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> +        }
>      }
>  
>      /*
Peter Xu Sept. 18, 2016, 5:33 a.m. UTC | #2
Good to know that we have a solution. :)

I think this is a good fix, however I still do not understand why this
is happening... Please see below comment.

On Thu, Sep 15, 2016 at 04:11:48PM +1000, David Gibson wrote:
> d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup
> of kvmchip routing configuration, that was mostly intended for x86.
> However, it also contains a subtle change in behaviour which breaks EEH[1]
> error recovery on certain VFIO passthrough devices on spapr guests.  So far
> it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be
> other hardware with the same problem.  It's also possible there could be
> circumstances where it causes a bug on x86 as well, though I don't know of
> any obvious candidates.
> 
> Prior to d1f6af6, both vfio_msix_vector_do_use() and
> vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this
> as the "dummy" vector used to make the host hardware state sync with the
> guest expected hardware state in terms of MSI configuration.
> 
> Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op,
> meaning the dummy irq would always be delivered via qemu.

AFAICT, QEMU is not delivering that *dummy* IRQ as well. IIUC here the
dummy IRQ refers to the one in vfio_msix_enable():

    vfio_msix_vector_do_use(&vdev->pdev, 0, NULL, NULL);
    vfio_msix_vector_release(&vdev->pdev, 0);

Here we registered this dummy one to sync up the guest and host
hardware state for some reason. However, the handler is NULL here,
means that even QEMU won't handle it if it happens. And the only
difference is that during this extremely small period, kvm will handle
the interrupt with an uninitialized MSI message, but assuming that
would never happen since no one should start to use MSI yet. After
release(), interrupts will be dropped for all cases.

So looks like it is not related to "whether QEMU or KVM will handle
the interrupt", but something else.

Generally speaking, the process of registering the IRQ should be
something like (using QEMU with d1f6af6 change):

1. vfio_msix_enable(): when MSIX is enabled. this will trigger
   vfio_add_kvm_msi_virq(), but quickly it is unregistered (virq will
   be there, but VFIO will assign the VFIO handler to
   vector->interrupt, rather than vector->kvm_interrupt, so that virq
   won't take effect).

2. vfio_msix_vector_use(): when an MSIX entry is finally used and
   setup. this will trigger vfio_update_kvm_msi_virq(), to update the
   interrupt information to the "real one".

IMHO we should always receive interrupts after step (2). And as we
have traced (with David), step (2) was updating the correct interrupt
information even with commit d1f6af6. But I think I might be wrong (or
say, the above assumption is not correct), because commit d1f6af6
indeed caused problem with EHH and this specific hardware. I am just
do not know why it is triggering the issue. And that's why I want to
know whether there is anything special with that NIC's driver.

Again, this is totally a discussion only, and I am totally agree with
current change to fix the issue. Just in case someone knows the reason
behind this.

Thanks,

-- peterx
David Gibson Sept. 20, 2016, 5:57 a.m. UTC | #3
On Sun, Sep 18, 2016 at 01:33:27PM +0800, Peter Xu wrote:
> Good to know that we have a solution. :)
> 
> I think this is a good fix, however I still do not understand why this
> is happening... Please see below comment.
> 
> On Thu, Sep 15, 2016 at 04:11:48PM +1000, David Gibson wrote:
> > d1f6af6 "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup
> > of kvmchip routing configuration, that was mostly intended for x86.
> > However, it also contains a subtle change in behaviour which breaks EEH[1]
> > error recovery on certain VFIO passthrough devices on spapr guests.  So far
> > it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be
> > other hardware with the same problem.  It's also possible there could be
> > circumstances where it causes a bug on x86 as well, though I don't know of
> > any obvious candidates.
> > 
> > Prior to d1f6af6, both vfio_msix_vector_do_use() and
> > vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this
> > as the "dummy" vector used to make the host hardware state sync with the
> > guest expected hardware state in terms of MSI configuration.
> > 
> > Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op,
> > meaning the dummy irq would always be delivered via qemu.
> 
> AFAICT, QEMU is not delivering that *dummy* IRQ as well. IIUC here the
> dummy IRQ refers to the one in vfio_msix_enable():
> 
>     vfio_msix_vector_do_use(&vdev->pdev, 0, NULL, NULL);
>     vfio_msix_vector_release(&vdev->pdev, 0);
> 
> Here we registered this dummy one to sync up the guest and host
> hardware state for some reason. However, the handler is NULL here,
> means that even QEMU won't handle it if it happens. And the only
> difference is that during this extremely small period, kvm will handle
> the interrupt with an uninitialized MSI message, but assuming that
> would never happen since no one should start to use MSI yet. After
> release(), interrupts will be dropped for all cases.
> 
> So looks like it is not related to "whether QEMU or KVM will handle
> the interrupt", but something else.
> 
> Generally speaking, the process of registering the IRQ should be
> something like (using QEMU with d1f6af6 change):
> 
> 1. vfio_msix_enable(): when MSIX is enabled. this will trigger
>    vfio_add_kvm_msi_virq(), but quickly it is unregistered (virq will
>    be there, but VFIO will assign the VFIO handler to
>    vector->interrupt, rather than vector->kvm_interrupt, so that virq
>    won't take effect).
> 
> 2. vfio_msix_vector_use(): when an MSIX entry is finally used and
>    setup. this will trigger vfio_update_kvm_msi_virq(), to update the
>    interrupt information to the "real one".
> 
> IMHO we should always receive interrupts after step (2). And as we
> have traced (with David), step (2) was updating the correct interrupt
> information even with commit d1f6af6. But I think I might be wrong (or
> say, the above assumption is not correct), because commit d1f6af6
> indeed caused problem with EHH and this specific hardware. I am just
> do not know why it is triggering the issue. And that's why I want to
> know whether there is anything special with that NIC's driver.

It's quite possible there's something unusual about the driver - this
didn't happen for the other NIC I tried, or for an XHCI.
Unfortunately, I don't know what it would be, and I really have no
idea where one would start looking.

> Again, this is totally a discussion only, and I am totally agree with
> current change to fix the issue. Just in case someone knows the reason
> behind this.
> 
> Thanks,
> 
> -- peterx
>
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7bfa17c..a5a620a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -496,7 +496,9 @@  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
             vfio_update_kvm_msi_virq(vector, *msg, pdev);
         }
     } else {
-        vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+        if (msg) {
+            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+        }
     }
 
     /*