Message ID | alpine.DEB.2.10.1701091049130.2866@sstabellini-ThinkPad-X260 (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> > <boris.ostrovsky@oracle.com> wrote: >> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> > >>> Do not read a pci device's msi message data to see if a pirq was >> > >>> previously configured for the device's msi/msix, as the old pirq was >> > >>> unmapped and may now be in use by another pci device. The previous >> > >>> pirq should never be re-used; instead a new pirq should always be >> > >>> allocated from the hypervisor. >> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> > >> as we are not reusing them? >> > > >> > > Don't we free the pirq when we unmap it? >> > >> > I think this is actually a bit worse than I initially thought. After >> > looking a bit closer, and I think there's an asymmetry with pirq >> > allocation: >> >> Lets include Stefano, >> >> Thank you for digging in this! This has quite the deja-vu >> feeling as I believe I hit this at some point in the past and >> posted some possible ways of fixing this. But sadly I can't >> find the thread. > > This issue seems to be caused by: > > commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > Date: Wed Dec 1 14:51:44 2010 +0000 > > xen: fix MSI setup and teardown for PV on HVM guests > > which was a fix to a bug: > > This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > trying to enable the same MSI for the second time: the old MSI to pirq > mapping is still valid at this point but xen_hvm_setup_msi_irqs would > try to assign a new pirq anyway. > A simple way to reproduce this bug is to assign an MSI capable network > card to a PV on HVM guest, if the user brings down the corresponding > ethernet interface and up again, Linux would fail to enable MSIs on the > device. > > I don't remember any of the details. From the description of this bug, > it seems that Xen changed behavior in the past few years: before it used > to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > old MSI to pirq mapping is still valid at this point", the pirq couldn't > have been completely freed, then reassigned to somebody else the way it > is described in this email. > > I think we should indentify the changeset or Xen version that introduced > the new behavior. If it is old enough, we might be able to just revert > af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > behavior conditional to the Xen version. Are PT devices the only MSI-capable devices available in a Xen guest? That's where I'm seeing this problem, with PT NVMe devices. I can say that on the Xen guest with NVMe PT devices I'm testing on, with the patch from this thread (which essentially reverts your commit above) as well as some added debug to see the pirq numbers, cycles of 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the hypervisor provides essentially the same pirqs for each modprobe, since they were freed by the rmmod. > > >> > tl;dr: >> > >> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >> > allocated, and reserved in the hypervisor >> > >> > request_irq() -> an event channel is opened for the specific pirq, and >> > maps the pirq with the hypervisor >> > >> > free_irq() -> the event channel is closed, and the pirq is unmapped, >> > but that unmap function also frees the pirq! The hypervisor can/will >> > give it away to the next call to get_free_pirq. However, the pci >> > msi/msix data area still contains the pirq number, and the next call >> > to request_irq() will re-use the pirq. >> > >> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >> > (it's already been freed), and it also doesn't clear anything from the >> > msi/msix data area, so the pirq is still cached there. >> > >> > >> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >> > when the event channel is closed - or at least, not to change it to >> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >> > call into something in the Xen guest kernel that actually does the >> > pirq unmapping, and clear it from the msi data area (i.e. >> > pci_write_msi_msg) >> >> The problem is that Xen changes have sailed a long long time ago. >> > >> > Alternately, if the hypervisor doesn't change, then the call into the >> > hypervisor to actually allocate a pirq needs to move from the >> > pci_enable_msix_range() call to the request_irq() call? So that when >> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >> > for any of the vectors; each request_irq/free_irq pair do the pirq >> > allocate-and-map/unmap... >> >> >> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >> from actually calling evtchn_close() until the pci_disable_msix() is done? > > I think that's a reasonable alternative: we mask the evtchn, but do not > call xen_evtchn_close in shutdown_pirq for PV on HVM guests. > Something like (not compiled, untested): > > diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c > index 137bd0e..3174923 100644 > --- a/drivers/xen/events/events_base.c > +++ b/drivers/xen/events/events_base.c > @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) > return; > > mask_evtchn(evtchn); > - xen_evtchn_close(evtchn); > + if (!xen_hvm_domain()) > + xen_evtchn_close(evtchn); wouldn't we need to also add code to the pci_disable_msix path that would actually close the evtchn? Would this leave the evtchn around forever? > xen_irq_info_cleanup(info); > } > > > We want to keep the pirq allocated to the device - not closing the > evtchn seems like the right thing to do. I suggest we test for the > original bug too: enable/disable the network interface of an MSI capable > network card. I don't have a Xen hypervisor setup myself, I'm just using AWS, but I'll try to test this on an instance with a SRIOV/PT nic. > > >> > longer details: >> > >> > The chain of function calls starts in the initial call to configure >> > the msi vectors, which eventually calls __pci_enable_msix_range (or >> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >> > either tries to re-use any cached pirq in the MSI data area, or (for >> > the first time setup) allocates a new pirq from the hypervisor via >> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >> > hypervisor's perspective, but it's not mapped to anything in the guest >> > kernel. >> > >> > Then, the driver calls request_irq to actually start using the irq, >> > which calls __setup_irq to irq_startup to startup_pirq. The >> > startup_pirq call actually creates the evtchn and binds it to the >> > previously allocated pirq via EVTCHNOP_bind_pirq. >> > >> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >> > and is in use. But then, when the driver doesn't want it anymore, it >> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >> > function closes the evtchn via EVTCHNOP_close. >> > >> > Inside the hypervisor, in xen/common/event_channel.c in >> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >> > channel is) then it unmaps the pirq mapping via >> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >> > give away to anyone requesting a new pirq! >> > >> > >> > >> > >> > >> > > >> > > -boris >> > > >> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >> > >>> it finds the vector was previously configured with a pirq, and that pirq >> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >> > >>> to a different pci device. >> > >> Hm, but this implies that we do keep track of it. >> > >> >> > >> >> > >> while (true) >> > >> do >> > >> rmmod nvme >> > >> modprobe nvme >> > >> done >> > >> >> > >> Things go boom without this patch. But with this patch does this >> > >> still work? As in we don't run out of PIRQs? >> > >> >> > >> Thanks. >> > >>> This exact situation is happening in a Xen guest where multiple NVMe >> > >>> controllers (pci devices) are present. The NVMe driver configures each >> > >>> pci device's msi/msix twice; first to configure a single vector (to >> > >>> talk to the controller for its configuration info), and then it disables >> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >> > >>> multiple NVMe controllers are present, this happens concurrently on all >> > >>> of them, and in the time between controller A calling pci_disable_msix() >> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >> > >>> controller A re-configures its msix, its first vector tries to re-use >> > >>> the same pirq that it had before; but that pirq was allocated to >> > >>> controller B, and thus the Xen event channel for controller A's re-used >> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >> > >>> pirq mapped elsewhere. >> > >>> >> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> > >>> --- >> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >> > >>> >> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> > >>> index bedfab9..a00a6c0 100644 >> > >>> --- a/arch/x86/pci/xen.c >> > >>> +++ b/arch/x86/pci/xen.c >> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> > >>> return 1; >> > >>> >> > >>> for_each_pci_msi_entry(msidesc, dev) { >> > >>> - __pci_read_msi_msg(msidesc, &msg); >> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >> > >>> - xen_irq_from_pirq(pirq) < 0) { >> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> > >>> - if (pirq < 0) { >> > >>> - irq = -ENODEV; >> > >>> - goto error; >> > >>> - } >> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >> > >>> - __pci_write_msi_msg(msidesc, &msg); >> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> > >>> - } else { >> > >>> - dev_dbg(&dev->dev, >> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> > >>> + if (pirq < 0) { >> > >>> + irq = -ENODEV; >> > >>> + goto error; >> > >>> } >> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >> > >>> + __pci_write_msi_msg(msidesc, &msg); >> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> > >>> (type == PCI_CAP_ID_MSIX) ? >> > >>> -- >> > >>> 2.9.3 >> > >>> >> > > >> > > >> > > _______________________________________________ >> > > Xen-devel mailing list >> > > Xen-devel@lists.xen.org >> > > https://lists.xen.org/xen-devel >>
On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>> > <boris.ostrovsky@oracle.com> wrote: >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> > >>> Do not read a pci device's msi message data to see if a pirq was >>> > >>> previously configured for the device's msi/msix, as the old pirq was >>> > >>> unmapped and may now be in use by another pci device. The previous >>> > >>> pirq should never be re-used; instead a new pirq should always be >>> > >>> allocated from the hypervisor. >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >>> > >> as we are not reusing them? >>> > > >>> > > Don't we free the pirq when we unmap it? >>> > >>> > I think this is actually a bit worse than I initially thought. After >>> > looking a bit closer, and I think there's an asymmetry with pirq >>> > allocation: >>> >>> Lets include Stefano, >>> >>> Thank you for digging in this! This has quite the deja-vu >>> feeling as I believe I hit this at some point in the past and >>> posted some possible ways of fixing this. But sadly I can't >>> find the thread. >> >> This issue seems to be caused by: >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> which was a fix to a bug: >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> trying to enable the same MSI for the second time: the old MSI to pirq >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> try to assign a new pirq anyway. >> A simple way to reproduce this bug is to assign an MSI capable network >> card to a PV on HVM guest, if the user brings down the corresponding >> ethernet interface and up again, Linux would fail to enable MSIs on the >> device. >> >> I don't remember any of the details. From the description of this bug, >> it seems that Xen changed behavior in the past few years: before it used >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> have been completely freed, then reassigned to somebody else the way it >> is described in this email. >> >> I think we should indentify the changeset or Xen version that introduced >> the new behavior. If it is old enough, we might be able to just revert >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> behavior conditional to the Xen version. > > Are PT devices the only MSI-capable devices available in a Xen guest? > That's where I'm seeing this problem, with PT NVMe devices. > > I can say that on the Xen guest with NVMe PT devices I'm testing on, > with the patch from this thread (which essentially reverts your commit > above) as well as some added debug to see the pirq numbers, cycles of > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > hypervisor provides essentially the same pirqs for each modprobe, > since they were freed by the rmmod. > >> >> >>> > tl;dr: >>> > >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >>> > allocated, and reserved in the hypervisor >>> > >>> > request_irq() -> an event channel is opened for the specific pirq, and >>> > maps the pirq with the hypervisor >>> > >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, >>> > but that unmap function also frees the pirq! The hypervisor can/will >>> > give it away to the next call to get_free_pirq. However, the pci >>> > msi/msix data area still contains the pirq number, and the next call >>> > to request_irq() will re-use the pirq. >>> > >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >>> > (it's already been freed), and it also doesn't clear anything from the >>> > msi/msix data area, so the pirq is still cached there. >>> > >>> > >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >>> > when the event channel is closed - or at least, not to change it to >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >>> > call into something in the Xen guest kernel that actually does the >>> > pirq unmapping, and clear it from the msi data area (i.e. >>> > pci_write_msi_msg) >>> >>> The problem is that Xen changes have sailed a long long time ago. >>> > >>> > Alternately, if the hypervisor doesn't change, then the call into the >>> > hypervisor to actually allocate a pirq needs to move from the >>> > pci_enable_msix_range() call to the request_irq() call? So that when >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >>> > for any of the vectors; each request_irq/free_irq pair do the pirq >>> > allocate-and-map/unmap... >>> >>> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >>> from actually calling evtchn_close() until the pci_disable_msix() is done? >> >> I think that's a reasonable alternative: we mask the evtchn, but do not >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. >> Something like (not compiled, untested): >> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c >> index 137bd0e..3174923 100644 >> --- a/drivers/xen/events/events_base.c >> +++ b/drivers/xen/events/events_base.c >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) >> return; >> >> mask_evtchn(evtchn); >> - xen_evtchn_close(evtchn); >> + if (!xen_hvm_domain()) >> + xen_evtchn_close(evtchn); > > wouldn't we need to also add code to the pci_disable_msix path that > would actually close the evtchn? Would this leave the evtchn around > forever? > >> xen_irq_info_cleanup(info); >> } >> >> >> We want to keep the pirq allocated to the device - not closing the >> evtchn seems like the right thing to do. I suggest we test for the >> original bug too: enable/disable the network interface of an MSI capable >> network card. > > I don't have a Xen hypervisor setup myself, I'm just using AWS, but > I'll try to test this on an instance with a SRIOV/PT nic. I started an instance with SRIOV nics, with the latest upstream kernel plus this patch and debug to show the pirqs. Taking an interface down and back up uses the same pirq, because the driver only does free_irq/request_irq, which just closes/reopens the evtchn using the same pirq (which is cached in the struct irq_info) - it doesn't disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of the driver does disable and reenable the MSIX vectors, but the hypervisor provides the same pirqs from the get_free_pirq() call that were used before (since nothing else is asking for free pirqs). > >> >> >>> > longer details: >>> > >>> > The chain of function calls starts in the initial call to configure >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >>> > either tries to re-use any cached pirq in the MSI data area, or (for >>> > the first time setup) allocates a new pirq from the hypervisor via >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >>> > hypervisor's perspective, but it's not mapped to anything in the guest >>> > kernel. >>> > >>> > Then, the driver calls request_irq to actually start using the irq, >>> > which calls __setup_irq to irq_startup to startup_pirq. The >>> > startup_pirq call actually creates the evtchn and binds it to the >>> > previously allocated pirq via EVTCHNOP_bind_pirq. >>> > >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >>> > and is in use. But then, when the driver doesn't want it anymore, it >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >>> > function closes the evtchn via EVTCHNOP_close. >>> > >>> > Inside the hypervisor, in xen/common/event_channel.c in >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >>> > channel is) then it unmaps the pirq mapping via >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >>> > give away to anyone requesting a new pirq! >>> > >>> > >>> > >>> > >>> > >>> > > >>> > > -boris >>> > > >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >>> > >>> it finds the vector was previously configured with a pirq, and that pirq >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >>> > >>> to a different pci device. >>> > >> Hm, but this implies that we do keep track of it. >>> > >> >>> > >> >>> > >> while (true) >>> > >> do >>> > >> rmmod nvme >>> > >> modprobe nvme >>> > >> done >>> > >> >>> > >> Things go boom without this patch. But with this patch does this >>> > >> still work? As in we don't run out of PIRQs? >>> > >> >>> > >> Thanks. >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe >>> > >>> controllers (pci devices) are present. The NVMe driver configures each >>> > >>> pci device's msi/msix twice; first to configure a single vector (to >>> > >>> talk to the controller for its configuration info), and then it disables >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >>> > >>> multiple NVMe controllers are present, this happens concurrently on all >>> > >>> of them, and in the time between controller A calling pci_disable_msix() >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >>> > >>> controller A re-configures its msix, its first vector tries to re-use >>> > >>> the same pirq that it had before; but that pirq was allocated to >>> > >>> controller B, and thus the Xen event channel for controller A's re-used >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >>> > >>> pirq mapped elsewhere. >>> > >>> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >>> > >>> --- >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >>> > >>> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >>> > >>> index bedfab9..a00a6c0 100644 >>> > >>> --- a/arch/x86/pci/xen.c >>> > >>> +++ b/arch/x86/pci/xen.c >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >>> > >>> return 1; >>> > >>> >>> > >>> for_each_pci_msi_entry(msidesc, dev) { >>> > >>> - __pci_read_msi_msg(msidesc, &msg); >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >>> > >>> - xen_irq_from_pirq(pirq) < 0) { >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >>> > >>> - if (pirq < 0) { >>> > >>> - irq = -ENODEV; >>> > >>> - goto error; >>> > >>> - } >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >>> > >>> - __pci_write_msi_msg(msidesc, &msg); >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> > >>> - } else { >>> > >>> - dev_dbg(&dev->dev, >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >>> > >>> + if (pirq < 0) { >>> > >>> + irq = -ENODEV; >>> > >>> + goto error; >>> > >>> } >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >>> > >>> + __pci_write_msi_msg(msidesc, &msg); >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >>> > >>> (type == PCI_CAP_ID_MSIX) ? >>> > >>> -- >>> > >>> 2.9.3 >>> > >>> >>> > > >>> > > >>> > > _______________________________________________ >>> > > Xen-devel mailing list >>> > > Xen-devel@lists.xen.org >>> > > https://lists.xen.org/xen-devel >>>
On Tue, 10 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > > <sstabellini@kernel.org> wrote: > >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >>> > <boris.ostrovsky@oracle.com> wrote: > >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >>> > >>> unmapped and may now be in use by another pci device. The previous > >>> > >>> pirq should never be re-used; instead a new pirq should always be > >>> > >>> allocated from the hypervisor. > >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >>> > >> as we are not reusing them? > >>> > > > >>> > > Don't we free the pirq when we unmap it? > >>> > > >>> > I think this is actually a bit worse than I initially thought. After > >>> > looking a bit closer, and I think there's an asymmetry with pirq > >>> > allocation: > >>> > >>> Lets include Stefano, > >>> > >>> Thank you for digging in this! This has quite the deja-vu > >>> feeling as I believe I hit this at some point in the past and > >>> posted some possible ways of fixing this. But sadly I can't > >>> find the thread. > >> > >> This issue seems to be caused by: > >> > >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> > >> xen: fix MSI setup and teardown for PV on HVM guests > >> > >> which was a fix to a bug: > >> > >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> trying to enable the same MSI for the second time: the old MSI to pirq > >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> try to assign a new pirq anyway. > >> A simple way to reproduce this bug is to assign an MSI capable network > >> card to a PV on HVM guest, if the user brings down the corresponding > >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> device. > >> > >> I don't remember any of the details. From the description of this bug, > >> it seems that Xen changed behavior in the past few years: before it used > >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> have been completely freed, then reassigned to somebody else the way it > >> is described in this email. > >> > >> I think we should indentify the changeset or Xen version that introduced > >> the new behavior. If it is old enough, we might be able to just revert > >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> behavior conditional to the Xen version. > > > > Are PT devices the only MSI-capable devices available in a Xen guest? > > That's where I'm seeing this problem, with PT NVMe devices. They are the main ones. It is possible to have emulated virtio devices with emulated MSIs, for example virtio-net. Althought they are not in the Xen Project CI-loop, so I wouldn't be surprised if they are broken too. > > I can say that on the Xen guest with NVMe PT devices I'm testing on, > > with the patch from this thread (which essentially reverts your commit > > above) as well as some added debug to see the pirq numbers, cycles of > > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > > hypervisor provides essentially the same pirqs for each modprobe, > > since they were freed by the rmmod. I am fine with reverting the old patch, but we need to understand what caused the change in behavior first. Maybe somebody else with a Xen PCI passthrough setup at hand can help with that. In the Xen code I can still see: case ECS_PIRQ: { struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); if ( !pirq ) break; if ( !is_hvm_domain(d1) ) pirq_guest_unbind(d1, pirq); which means that pirq_guest_unbind should only be called on evtchn_close if the guest is not an HVM guest. > >>> > tl;dr: > >>> > > >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq > >>> > allocated, and reserved in the hypervisor > >>> > > >>> > request_irq() -> an event channel is opened for the specific pirq, and > >>> > maps the pirq with the hypervisor > >>> > > >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, > >>> > but that unmap function also frees the pirq! The hypervisor can/will > >>> > give it away to the next call to get_free_pirq. However, the pci > >>> > msi/msix data area still contains the pirq number, and the next call > >>> > to request_irq() will re-use the pirq. > >>> > > >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor > >>> > (it's already been freed), and it also doesn't clear anything from the > >>> > msi/msix data area, so the pirq is still cached there. > >>> > > >>> > > >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq > >>> > when the event channel is closed - or at least, not to change it to > >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually > >>> > call into something in the Xen guest kernel that actually does the > >>> > pirq unmapping, and clear it from the msi data area (i.e. > >>> > pci_write_msi_msg) > >>> > >>> The problem is that Xen changes have sailed a long long time ago. > >>> > > >>> > Alternately, if the hypervisor doesn't change, then the call into the > >>> > hypervisor to actually allocate a pirq needs to move from the > >>> > pci_enable_msix_range() call to the request_irq() call? So that when > >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's > >>> > for any of the vectors; each request_irq/free_irq pair do the pirq > >>> > allocate-and-map/unmap... > >>> > >>> > >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() > >>> from actually calling evtchn_close() until the pci_disable_msix() is done? > >> > >> I think that's a reasonable alternative: we mask the evtchn, but do not > >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. > >> Something like (not compiled, untested): > >> > >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c > >> index 137bd0e..3174923 100644 > >> --- a/drivers/xen/events/events_base.c > >> +++ b/drivers/xen/events/events_base.c > >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) > >> return; > >> > >> mask_evtchn(evtchn); > >> - xen_evtchn_close(evtchn); > >> + if (!xen_hvm_domain()) > >> + xen_evtchn_close(evtchn); > > > > wouldn't we need to also add code to the pci_disable_msix path that > > would actually close the evtchn? Would this leave the evtchn around > > forever? > > > >> xen_irq_info_cleanup(info); > >> } > >> > >> > >> We want to keep the pirq allocated to the device - not closing the > >> evtchn seems like the right thing to do. I suggest we test for the > >> original bug too: enable/disable the network interface of an MSI capable > >> network card. > > > > I don't have a Xen hypervisor setup myself, I'm just using AWS, but > > I'll try to test this on an instance with a SRIOV/PT nic. > > I started an instance with SRIOV nics, with the latest upstream kernel > plus this patch and debug to show the pirqs. Taking an interface down > and back up uses the same pirq, because the driver only does > free_irq/request_irq, which just closes/reopens the evtchn using the > same pirq (which is cached in the struct irq_info) - it doesn't > disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of > the driver does disable and reenable the MSIX vectors, but the > hypervisor provides the same pirqs from the get_free_pirq() call that > were used before (since nothing else is asking for free pirqs). Good! Thanks for testing, it's very helpful. I believe it should work even if the hypervisor returned a different pirq though. > >>> > longer details: > >>> > > >>> > The chain of function calls starts in the initial call to configure > >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or > >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which > >>> > either tries to re-use any cached pirq in the MSI data area, or (for > >>> > the first time setup) allocates a new pirq from the hypervisor via > >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the > >>> > hypervisor's perspective, but it's not mapped to anything in the guest > >>> > kernel. > >>> > > >>> > Then, the driver calls request_irq to actually start using the irq, > >>> > which calls __setup_irq to irq_startup to startup_pirq. The > >>> > startup_pirq call actually creates the evtchn and binds it to the > >>> > previously allocated pirq via EVTCHNOP_bind_pirq. > >>> > > >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) > >>> > and is in use. But then, when the driver doesn't want it anymore, it > >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that > >>> > function closes the evtchn via EVTCHNOP_close. > >>> > > >>> > Inside the hypervisor, in xen/common/event_channel.c in > >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq > >>> > channel is) then it unmaps the pirq mapping via > >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back > >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to > >>> > give away to anyone requesting a new pirq! > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > > >>> > > -boris > >>> > > > >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's > >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if > >>> > >>> it finds the vector was previously configured with a pirq, and that pirq > >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device > >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given > >>> > >>> to a different pci device. > >>> > >> Hm, but this implies that we do keep track of it. > >>> > >> > >>> > >> > >>> > >> while (true) > >>> > >> do > >>> > >> rmmod nvme > >>> > >> modprobe nvme > >>> > >> done > >>> > >> > >>> > >> Things go boom without this patch. But with this patch does this > >>> > >> still work? As in we don't run out of PIRQs? > >>> > >> > >>> > >> Thanks. > >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe > >>> > >>> controllers (pci devices) are present. The NVMe driver configures each > >>> > >>> pci device's msi/msix twice; first to configure a single vector (to > >>> > >>> talk to the controller for its configuration info), and then it disables > >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When > >>> > >>> multiple NVMe controllers are present, this happens concurrently on all > >>> > >>> of them, and in the time between controller A calling pci_disable_msix() > >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix > >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when > >>> > >>> controller A re-configures its msix, its first vector tries to re-use > >>> > >>> the same pirq that it had before; but that pirq was allocated to > >>> > >>> controller B, and thus the Xen event channel for controller A's re-used > >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the > >>> > >>> pirq mapped elsewhere. > >>> > >>> > >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> > >>> > >>> --- > >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- > >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) > >>> > >>> > >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > >>> > >>> index bedfab9..a00a6c0 100644 > >>> > >>> --- a/arch/x86/pci/xen.c > >>> > >>> +++ b/arch/x86/pci/xen.c > >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) > >>> > >>> return 1; > >>> > >>> > >>> > >>> for_each_pci_msi_entry(msidesc, dev) { > >>> > >>> - __pci_read_msi_msg(msidesc, &msg); > >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || > >>> > >>> - xen_irq_from_pirq(pirq) < 0) { > >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> > >>> - if (pirq < 0) { > >>> > >>> - irq = -ENODEV; > >>> > >>> - goto error; > >>> > >>> - } > >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); > >>> > >>> - __pci_write_msi_msg(msidesc, &msg); > >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> > >>> - } else { > >>> > >>> - dev_dbg(&dev->dev, > >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); > >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> > >>> + if (pirq < 0) { > >>> > >>> + irq = -ENODEV; > >>> > >>> + goto error; > >>> > >>> } > >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); > >>> > >>> + __pci_write_msi_msg(msidesc, &msg); > >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, > >>> > >>> (type == PCI_CAP_ID_MSIX) ? > >>> > >>> -- > >>> > >>> 2.9.3 > >>> > >>> > >>> > > > >>> > > > >>> > > _______________________________________________ > >>> > > Xen-devel mailing list > >>> > > Xen-devel@lists.xen.org > >>> > > https://lists.xen.org/xen-devel > >>> >
On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Tue, 10 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> > <sstabellini@kernel.org> wrote: >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >>> > >>> allocated from the hypervisor. >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >>> > >> as we are not reusing them? >> >>> > > >> >>> > > Don't we free the pirq when we unmap it? >> >>> > >> >>> > I think this is actually a bit worse than I initially thought. After >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >>> > allocation: >> >>> >> >>> Lets include Stefano, >> >>> >> >>> Thank you for digging in this! This has quite the deja-vu >> >>> feeling as I believe I hit this at some point in the past and >> >>> posted some possible ways of fixing this. But sadly I can't >> >>> find the thread. >> >> >> >> This issue seems to be caused by: >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> which was a fix to a bug: >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> try to assign a new pirq anyway. >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> device. >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> it seems that Xen changed behavior in the past few years: before it used >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> have been completely freed, then reassigned to somebody else the way it >> >> is described in this email. >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> the new behavior. If it is old enough, we might be able to just revert >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> behavior conditional to the Xen version. >> > >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> > That's where I'm seeing this problem, with PT NVMe devices. > > They are the main ones. It is possible to have emulated virtio devices > with emulated MSIs, for example virtio-net. Althought they are not in > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > too. > > >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> > with the patch from this thread (which essentially reverts your commit >> > above) as well as some added debug to see the pirq numbers, cycles of >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> > hypervisor provides essentially the same pirqs for each modprobe, >> > since they were freed by the rmmod. > > I am fine with reverting the old patch, but we need to understand what > caused the change in behavior first. Maybe somebody else with a Xen PCI > passthrough setup at hand can help with that. > > In the Xen code I can still see: > > case ECS_PIRQ: { > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > > if ( !pirq ) > break; > if ( !is_hvm_domain(d1) ) > pirq_guest_unbind(d1, pirq); > > which means that pirq_guest_unbind should only be called on evtchn_close > if the guest is not an HVM guest. I tried an experiment to call get_free_pirq on both sides of a evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I see something interesting; each nic uses 3 MSIs, and it looks like when they shut down, each nic's 3 pirqs are not available until after the nic is done shutting down, so it seems like closing the evtchn isn't what makes the pirq free. [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == 0, xen_pvh_domain() == 0 just to be sure, a bit of dbg so I know what domain this is :-) [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but none of those become available for get_free_pirq. [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 aha, now nic 4 has fully finished shutting down, and nic 3 has started shutdown; we see those pirqs from nic 4 are now available. So it must not be evtchn closing that frees the pirqs. [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 so, since pci_disable_msix eventually calls xen_teardown_msi_irq() which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() (and recompiled/rebooted) and found the pirqs have already been freed before that is called: [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 I'm still following thru the kernel code, but it's not immediately obvious what exactly is telling the hypervisor to free the pirqs; any idea? From the hypervisor code, it seems that the pirq is "available" via is_free_pirq(): return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND)); when the evtchn is closed, it does: if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) unmap_domain_pirq_emuirq(d1, pirq->pirq); and that call to unmap_domain_pirq_emuirq does: info->arch.hvm.emuirq = IRQ_UNBOUND; so, the only thing left is to clear pirq->arch.irq,but the only place I can find that does that is clear_domain_irq_pirq(), which is only called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not seeing where either of those would be called when all the kernel is doing is disabling a pci device. > > >> >>> > tl;dr: >> >>> > >> >>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq >> >>> > allocated, and reserved in the hypervisor >> >>> > >> >>> > request_irq() -> an event channel is opened for the specific pirq, and >> >>> > maps the pirq with the hypervisor >> >>> > >> >>> > free_irq() -> the event channel is closed, and the pirq is unmapped, >> >>> > but that unmap function also frees the pirq! The hypervisor can/will >> >>> > give it away to the next call to get_free_pirq. However, the pci >> >>> > msi/msix data area still contains the pirq number, and the next call >> >>> > to request_irq() will re-use the pirq. >> >>> > >> >>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor >> >>> > (it's already been freed), and it also doesn't clear anything from the >> >>> > msi/msix data area, so the pirq is still cached there. >> >>> > >> >>> > >> >>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq >> >>> > when the event channel is closed - or at least, not to change it to >> >>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually >> >>> > call into something in the Xen guest kernel that actually does the >> >>> > pirq unmapping, and clear it from the msi data area (i.e. >> >>> > pci_write_msi_msg) >> >>> >> >>> The problem is that Xen changes have sailed a long long time ago. >> >>> > >> >>> > Alternately, if the hypervisor doesn't change, then the call into the >> >>> > hypervisor to actually allocate a pirq needs to move from the >> >>> > pci_enable_msix_range() call to the request_irq() call? So that when >> >>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's >> >>> > for any of the vectors; each request_irq/free_irq pair do the pirq >> >>> > allocate-and-map/unmap... >> >>> >> >>> >> >>> Or a third one: We keep an pirq->device lookup and inhibit free_irq() >> >>> from actually calling evtchn_close() until the pci_disable_msix() is done? >> >> >> >> I think that's a reasonable alternative: we mask the evtchn, but do not >> >> call xen_evtchn_close in shutdown_pirq for PV on HVM guests. >> >> Something like (not compiled, untested): >> >> >> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c >> >> index 137bd0e..3174923 100644 >> >> --- a/drivers/xen/events/events_base.c >> >> +++ b/drivers/xen/events/events_base.c >> >> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) >> >> return; >> >> >> >> mask_evtchn(evtchn); >> >> - xen_evtchn_close(evtchn); >> >> + if (!xen_hvm_domain()) >> >> + xen_evtchn_close(evtchn); >> > >> > wouldn't we need to also add code to the pci_disable_msix path that >> > would actually close the evtchn? Would this leave the evtchn around >> > forever? >> > >> >> xen_irq_info_cleanup(info); >> >> } >> >> >> >> >> >> We want to keep the pirq allocated to the device - not closing the >> >> evtchn seems like the right thing to do. I suggest we test for the >> >> original bug too: enable/disable the network interface of an MSI capable >> >> network card. >> > >> > I don't have a Xen hypervisor setup myself, I'm just using AWS, but >> > I'll try to test this on an instance with a SRIOV/PT nic. >> >> I started an instance with SRIOV nics, with the latest upstream kernel >> plus this patch and debug to show the pirqs. Taking an interface down >> and back up uses the same pirq, because the driver only does >> free_irq/request_irq, which just closes/reopens the evtchn using the >> same pirq (which is cached in the struct irq_info) - it doesn't >> disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of >> the driver does disable and reenable the MSIX vectors, but the >> hypervisor provides the same pirqs from the get_free_pirq() call that >> were used before (since nothing else is asking for free pirqs). > > Good! Thanks for testing, it's very helpful. I believe it should work > even if the hypervisor returned a different pirq though. > > >> >>> > longer details: >> >>> > >> >>> > The chain of function calls starts in the initial call to configure >> >>> > the msi vectors, which eventually calls __pci_enable_msix_range (or >> >>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which >> >>> > either tries to re-use any cached pirq in the MSI data area, or (for >> >>> > the first time setup) allocates a new pirq from the hypervisor via >> >>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the >> >>> > hypervisor's perspective, but it's not mapped to anything in the guest >> >>> > kernel. >> >>> > >> >>> > Then, the driver calls request_irq to actually start using the irq, >> >>> > which calls __setup_irq to irq_startup to startup_pirq. The >> >>> > startup_pirq call actually creates the evtchn and binds it to the >> >>> > previously allocated pirq via EVTCHNOP_bind_pirq. >> >>> > >> >>> > At this point, the pirq is bound to a guest kernel evtchn (and irq) >> >>> > and is in use. But then, when the driver doesn't want it anymore, it >> >>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that >> >>> > function closes the evtchn via EVTCHNOP_close. >> >>> > >> >>> > Inside the hypervisor, in xen/common/event_channel.c in >> >>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq >> >>> > channel is) then it unmaps the pirq mapping via >> >>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back >> >>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to >> >>> > give away to anyone requesting a new pirq! >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > > >> >>> > > -boris >> >>> > > >> >>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's >> >>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if >> >>> > >>> it finds the vector was previously configured with a pirq, and that pirq >> >>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq >> >>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device >> >>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given >> >>> > >>> to a different pci device. >> >>> > >> Hm, but this implies that we do keep track of it. >> >>> > >> >> >>> > >> >> >>> > >> while (true) >> >>> > >> do >> >>> > >> rmmod nvme >> >>> > >> modprobe nvme >> >>> > >> done >> >>> > >> >> >>> > >> Things go boom without this patch. But with this patch does this >> >>> > >> still work? As in we don't run out of PIRQs? >> >>> > >> >> >>> > >> Thanks. >> >>> > >>> This exact situation is happening in a Xen guest where multiple NVMe >> >>> > >>> controllers (pci devices) are present. The NVMe driver configures each >> >>> > >>> pci device's msi/msix twice; first to configure a single vector (to >> >>> > >>> talk to the controller for its configuration info), and then it disables >> >>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When >> >>> > >>> multiple NVMe controllers are present, this happens concurrently on all >> >>> > >>> of them, and in the time between controller A calling pci_disable_msix() >> >>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix >> >>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when >> >>> > >>> controller A re-configures its msix, its first vector tries to re-use >> >>> > >>> the same pirq that it had before; but that pirq was allocated to >> >>> > >>> controller B, and thus the Xen event channel for controller A's re-used >> >>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the >> >>> > >>> pirq mapped elsewhere. >> >>> > >>> >> >>> > >>> Signed-off-by: Dan Streetman <dan.streetman@canonical.com> >> >>> > >>> --- >> >>> > >>> arch/x86/pci/xen.c | 23 +++++++---------------- >> >>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-) >> >>> > >>> >> >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c >> >>> > >>> index bedfab9..a00a6c0 100644 >> >>> > >>> --- a/arch/x86/pci/xen.c >> >>> > >>> +++ b/arch/x86/pci/xen.c >> >>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) >> >>> > >>> return 1; >> >>> > >>> >> >>> > >>> for_each_pci_msi_entry(msidesc, dev) { >> >>> > >>> - __pci_read_msi_msg(msidesc, &msg); >> >>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | >> >>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); >> >>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || >> >>> > >>> - xen_irq_from_pirq(pirq) < 0) { >> >>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> - if (pirq < 0) { >> >>> > >>> - irq = -ENODEV; >> >>> > >>> - goto error; >> >>> > >>> - } >> >>> > >>> - xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> - __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> - } else { >> >>> > >>> - dev_dbg(&dev->dev, >> >>> > >>> - "xen: msi already bound to pirq=%d\n", pirq); >> >>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); >> >>> > >>> + if (pirq < 0) { >> >>> > >>> + irq = -ENODEV; >> >>> > >>> + goto error; >> >>> > >>> } >> >>> > >>> + xen_msi_compose_msg(dev, pirq, &msg); >> >>> > >>> + __pci_write_msi_msg(msidesc, &msg); >> >>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); >> >>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, >> >>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1, >> >>> > >>> (type == PCI_CAP_ID_MSIX) ? >> >>> > >>> -- >> >>> > >>> 2.9.3 >> >>> > >>> >> >>> > > >> >>> > > >> >>> > > _______________________________________________ >> >>> > > Xen-devel mailing list >> >>> > > Xen-devel@lists.xen.org >> >>> > > https://lists.xen.org/xen-devel >> >>> >>
>> In the Xen code I can still see: >> >> case ECS_PIRQ: { >> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> >> if ( !pirq ) >> break; >> if ( !is_hvm_domain(d1) ) >> pirq_guest_unbind(d1, pirq); >> >> which means that pirq_guest_unbind should only be called on evtchn_close >> if the guest is not an HVM guest. A few lines further down we have: pirq->evtchn = 0; pirq_cleanup_check(pirq, d1); unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]); #ifdef CONFIG_X86 if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) unmap_domain_pirq_emuirq(d1, pirq->pirq); #endif which is where we free up the pirq. -boris
On Tue, 10 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: > > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >> > <sstabellini@kernel.org> wrote: > >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >> >>> > <boris.ostrovsky@oracle.com> wrote: > >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >> >>> > >>> allocated from the hypervisor. > >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> >>> > >> as we are not reusing them? > >> >>> > > > >> >>> > > Don't we free the pirq when we unmap it? > >> >>> > > >> >>> > I think this is actually a bit worse than I initially thought. After > >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >> >>> > allocation: > >> >>> > >> >>> Lets include Stefano, > >> >>> > >> >>> Thank you for digging in this! This has quite the deja-vu > >> >>> feeling as I believe I hit this at some point in the past and > >> >>> posted some possible ways of fixing this. But sadly I can't > >> >>> find the thread. > >> >> > >> >> This issue seems to be caused by: > >> >> > >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> >> > >> >> xen: fix MSI setup and teardown for PV on HVM guests > >> >> > >> >> which was a fix to a bug: > >> >> > >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> >> try to assign a new pirq anyway. > >> >> A simple way to reproduce this bug is to assign an MSI capable network > >> >> card to a PV on HVM guest, if the user brings down the corresponding > >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> >> device. > >> >> > >> >> I don't remember any of the details. From the description of this bug, > >> >> it seems that Xen changed behavior in the past few years: before it used > >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> >> have been completely freed, then reassigned to somebody else the way it > >> >> is described in this email. > >> >> > >> >> I think we should indentify the changeset or Xen version that introduced > >> >> the new behavior. If it is old enough, we might be able to just revert > >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> >> behavior conditional to the Xen version. > >> > > >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >> > That's where I'm seeing this problem, with PT NVMe devices. > > > > They are the main ones. It is possible to have emulated virtio devices > > with emulated MSIs, for example virtio-net. Althought they are not in > > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > > too. > > > > > >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >> > with the patch from this thread (which essentially reverts your commit > >> > above) as well as some added debug to see the pirq numbers, cycles of > >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >> > hypervisor provides essentially the same pirqs for each modprobe, > >> > since they were freed by the rmmod. > > > > I am fine with reverting the old patch, but we need to understand what > > caused the change in behavior first. Maybe somebody else with a Xen PCI > > passthrough setup at hand can help with that. > > > > In the Xen code I can still see: > > > > case ECS_PIRQ: { > > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > > > > if ( !pirq ) > > break; > > if ( !is_hvm_domain(d1) ) > > pirq_guest_unbind(d1, pirq); > > > > which means that pirq_guest_unbind should only be called on evtchn_close > > if the guest is not an HVM guest. > > I tried an experiment to call get_free_pirq on both sides of a > evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > see something interesting; each nic uses 3 MSIs, and it looks like > when they shut down, each nic's 3 pirqs are not available until after > the nic is done shutting down, so it seems like closing the evtchn > isn't what makes the pirq free. > > [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > > nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > > [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > 0, xen_pvh_domain() == 0 > > just to be sure, a bit of dbg so I know what domain this is :-) > > [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > > nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > none of those become available for get_free_pirq. > > [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > > aha, now nic 4 has fully finished shutting down, and nic 3 has started > shutdown; we see those pirqs from nic 4 are now available. So it must > not be evtchn closing that frees the pirqs. > > [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > > > so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > (and recompiled/rebooted) and found the pirqs have already been freed > before that is called: > > [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > > > I'm still following thru the kernel code, but it's not immediately > obvious what exactly is telling the hypervisor to free the pirqs; any > idea? > > >From the hypervisor code, it seems that the pirq is "available" via > is_free_pirq(): > return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > > when the evtchn is closed, it does: > if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > unmap_domain_pirq_emuirq(d1, pirq->pirq); > > and that call to unmap_domain_pirq_emuirq does: > info->arch.hvm.emuirq = IRQ_UNBOUND; > > so, the only thing left is to clear pirq->arch.irq,but the only place > I can find that does that is clear_domain_irq_pirq(), which is only > called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > seeing where either of those would be called when all the kernel is > doing is disabling a pci device. Thanks for the info. I think I know what causes the pirq to be unmapped: when Linux disables msi or msix on the device, using the regular pci config space based method, QEMU (which emulates the PCI config space) tells Xen to unmap the pirq. If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs to be called a second time without msis being disabled first, then I think we can revert the patch.
On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Tue, 10 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >> <sstabellini@kernel.org> wrote: >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> >> > <sstabellini@kernel.org> wrote: >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >> >>> > >>> allocated from the hypervisor. >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >> >>> > >> as we are not reusing them? >> >> >>> > > >> >> >>> > > Don't we free the pirq when we unmap it? >> >> >>> > >> >> >>> > I think this is actually a bit worse than I initially thought. After >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >> >>> > allocation: >> >> >>> >> >> >>> Lets include Stefano, >> >> >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu >> >> >>> feeling as I believe I hit this at some point in the past and >> >> >>> posted some possible ways of fixing this. But sadly I can't >> >> >>> find the thread. >> >> >> >> >> >> This issue seems to be caused by: >> >> >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> >> >> which was a fix to a bug: >> >> >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> >> try to assign a new pirq anyway. >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> >> device. >> >> >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> >> it seems that Xen changed behavior in the past few years: before it used >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> >> have been completely freed, then reassigned to somebody else the way it >> >> >> is described in this email. >> >> >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> >> the new behavior. If it is old enough, we might be able to just revert >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> >> behavior conditional to the Xen version. >> >> > >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> >> > That's where I'm seeing this problem, with PT NVMe devices. >> > >> > They are the main ones. It is possible to have emulated virtio devices >> > with emulated MSIs, for example virtio-net. Althought they are not in >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >> > too. >> > >> > >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> >> > with the patch from this thread (which essentially reverts your commit >> >> > above) as well as some added debug to see the pirq numbers, cycles of >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> >> > hypervisor provides essentially the same pirqs for each modprobe, >> >> > since they were freed by the rmmod. >> > >> > I am fine with reverting the old patch, but we need to understand what >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >> > passthrough setup at hand can help with that. >> > >> > In the Xen code I can still see: >> > >> > case ECS_PIRQ: { >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> > >> > if ( !pirq ) >> > break; >> > if ( !is_hvm_domain(d1) ) >> > pirq_guest_unbind(d1, pirq); >> > >> > which means that pirq_guest_unbind should only be called on evtchn_close >> > if the guest is not an HVM guest. >> >> I tried an experiment to call get_free_pirq on both sides of a >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >> see something interesting; each nic uses 3 MSIs, and it looks like >> when they shut down, each nic's 3 pirqs are not available until after >> the nic is done shutting down, so it seems like closing the evtchn >> isn't what makes the pirq free. >> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >> 0, xen_pvh_domain() == 0 >> >> just to be sure, a bit of dbg so I know what domain this is :-) >> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >> none of those become available for get_free_pirq. >> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >> shutdown; we see those pirqs from nic 4 are now available. So it must >> not be evtchn closing that frees the pirqs. >> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >> >> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >> (and recompiled/rebooted) and found the pirqs have already been freed >> before that is called: >> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >> >> >> I'm still following thru the kernel code, but it's not immediately >> obvious what exactly is telling the hypervisor to free the pirqs; any >> idea? >> >> >From the hypervisor code, it seems that the pirq is "available" via >> is_free_pirq(): >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >> >> when the evtchn is closed, it does: >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >> >> and that call to unmap_domain_pirq_emuirq does: >> info->arch.hvm.emuirq = IRQ_UNBOUND; >> >> so, the only thing left is to clear pirq->arch.irq,but the only place >> I can find that does that is clear_domain_irq_pirq(), which is only >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >> seeing where either of those would be called when all the kernel is >> doing is disabling a pci device. > > Thanks for the info. I think I know what causes the pirq to be unmapped: > when Linux disables msi or msix on the device, using the regular pci > config space based method, QEMU (which emulates the PCI config space) > tells Xen to unmap the pirq. aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > to be called a second time without msis being disabled first, then I > think we can revert the patch. It doesn't seem possible to call it twice from a correctly-behaved driver, but in case of a driver bug that does try to enable msi/msix multiple times without disabling, __pci_enable_msix() only does WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that should be changed to warn and also return error, to prevent re-configuring msi/msix if already configured? Or, maybe the warning is enough - the worst thing that reverting the patch does is use extra pirqs, right?
On Wed, 11 Jan 2017, Dan Streetman wrote: > On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: > > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > >> <sstabellini@kernel.org> wrote: > >> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >> >> > <sstabellini@kernel.org> wrote: > >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >> >> >>> > <boris.ostrovsky@oracle.com> wrote: > >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >> >> >>> > >>> allocated from the hypervisor. > >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> >> >>> > >> as we are not reusing them? > >> >> >>> > > > >> >> >>> > > Don't we free the pirq when we unmap it? > >> >> >>> > > >> >> >>> > I think this is actually a bit worse than I initially thought. After > >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >> >> >>> > allocation: > >> >> >>> > >> >> >>> Lets include Stefano, > >> >> >>> > >> >> >>> Thank you for digging in this! This has quite the deja-vu > >> >> >>> feeling as I believe I hit this at some point in the past and > >> >> >>> posted some possible ways of fixing this. But sadly I can't > >> >> >>> find the thread. > >> >> >> > >> >> >> This issue seems to be caused by: > >> >> >> > >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >> >> >> > >> >> >> xen: fix MSI setup and teardown for PV on HVM guests > >> >> >> > >> >> >> which was a fix to a bug: > >> >> >> > >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >> >> >> try to assign a new pirq anyway. > >> >> >> A simple way to reproduce this bug is to assign an MSI capable network > >> >> >> card to a PV on HVM guest, if the user brings down the corresponding > >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >> >> >> device. > >> >> >> > >> >> >> I don't remember any of the details. From the description of this bug, > >> >> >> it seems that Xen changed behavior in the past few years: before it used > >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >> >> >> have been completely freed, then reassigned to somebody else the way it > >> >> >> is described in this email. > >> >> >> > >> >> >> I think we should indentify the changeset or Xen version that introduced > >> >> >> the new behavior. If it is old enough, we might be able to just revert > >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >> >> >> behavior conditional to the Xen version. > >> >> > > >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >> >> > That's where I'm seeing this problem, with PT NVMe devices. > >> > > >> > They are the main ones. It is possible to have emulated virtio devices > >> > with emulated MSIs, for example virtio-net. Althought they are not in > >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > >> > too. > >> > > >> > > >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >> >> > with the patch from this thread (which essentially reverts your commit > >> >> > above) as well as some added debug to see the pirq numbers, cycles of > >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >> >> > hypervisor provides essentially the same pirqs for each modprobe, > >> >> > since they were freed by the rmmod. > >> > > >> > I am fine with reverting the old patch, but we need to understand what > >> > caused the change in behavior first. Maybe somebody else with a Xen PCI > >> > passthrough setup at hand can help with that. > >> > > >> > In the Xen code I can still see: > >> > > >> > case ECS_PIRQ: { > >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > >> > > >> > if ( !pirq ) > >> > break; > >> > if ( !is_hvm_domain(d1) ) > >> > pirq_guest_unbind(d1, pirq); > >> > > >> > which means that pirq_guest_unbind should only be called on evtchn_close > >> > if the guest is not an HVM guest. > >> > >> I tried an experiment to call get_free_pirq on both sides of a > >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > >> see something interesting; each nic uses 3 MSIs, and it looks like > >> when they shut down, each nic's 3 pirqs are not available until after > >> the nic is done shutting down, so it seems like closing the evtchn > >> isn't what makes the pirq free. > >> > >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > >> > >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > >> > >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > >> 0, xen_pvh_domain() == 0 > >> > >> just to be sure, a bit of dbg so I know what domain this is :-) > >> > >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > >> > >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > >> none of those become available for get_free_pirq. > >> > >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > >> > >> aha, now nic 4 has fully finished shutting down, and nic 3 has started > >> shutdown; we see those pirqs from nic 4 are now available. So it must > >> not be evtchn closing that frees the pirqs. > >> > >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > >> > >> > >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > >> (and recompiled/rebooted) and found the pirqs have already been freed > >> before that is called: > >> > >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > >> > >> > >> I'm still following thru the kernel code, but it's not immediately > >> obvious what exactly is telling the hypervisor to free the pirqs; any > >> idea? > >> > >> >From the hypervisor code, it seems that the pirq is "available" via > >> is_free_pirq(): > >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > >> > >> when the evtchn is closed, it does: > >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > >> unmap_domain_pirq_emuirq(d1, pirq->pirq); > >> > >> and that call to unmap_domain_pirq_emuirq does: > >> info->arch.hvm.emuirq = IRQ_UNBOUND; > >> > >> so, the only thing left is to clear pirq->arch.irq,but the only place > >> I can find that does that is clear_domain_irq_pirq(), which is only > >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > >> seeing where either of those would be called when all the kernel is > >> doing is disabling a pci device. > > > > Thanks for the info. I think I know what causes the pirq to be unmapped: > > when Linux disables msi or msix on the device, using the regular pci > > config space based method, QEMU (which emulates the PCI config space) > > tells Xen to unmap the pirq. > > aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > > > > > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > > to be called a second time without msis being disabled first, then I > > think we can revert the patch. > > It doesn't seem possible to call it twice from a correctly-behaved > driver, but in case of a driver bug that does try to enable msi/msix > multiple times without disabling, __pci_enable_msix() only does > WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does > WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that > should be changed to warn and also return error, to prevent > re-configuring msi/msix if already configured? Or, maybe the warning > is enough - the worst thing that reverting the patch does is use extra > pirqs, right? I think the warning is enough. Can you confirm that with af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also ifconfig eth0 down; ifconfig eth0 up still work as expected, no warnings? It looks like the patch that changed hypervisor (QEMU actually) behavior is: commit c976437c7dba9c7444fb41df45468968aaa326ad Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> Date: Wed May 7 13:41:48 2014 +0000 qemu-xen: free all the pirqs for msi/msix when driver unload From this commit onward, QEMU started unmapping pirqs when MSIs are disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 and older. Given that Xen 4.4 is out of support, I think we should go ahead with it. Opinions?
On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini <sstabellini@kernel.org> wrote: > On Wed, 11 Jan 2017, Dan Streetman wrote: >> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >> <sstabellini@kernel.org> wrote: >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >> >> <sstabellini@kernel.org> wrote: >> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >> >> >> > <sstabellini@kernel.org> wrote: >> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >> >> >> >>> > >>> allocated from the hypervisor. >> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >> >> >> >>> > >> as we are not reusing them? >> >> >> >>> > > >> >> >> >>> > > Don't we free the pirq when we unmap it? >> >> >> >>> > >> >> >> >>> > I think this is actually a bit worse than I initially thought. After >> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >> >> >> >>> > allocation: >> >> >> >>> >> >> >> >>> Lets include Stefano, >> >> >> >>> >> >> >> >>> Thank you for digging in this! This has quite the deja-vu >> >> >> >>> feeling as I believe I hit this at some point in the past and >> >> >> >>> posted some possible ways of fixing this. But sadly I can't >> >> >> >>> find the thread. >> >> >> >> >> >> >> >> This issue seems to be caused by: >> >> >> >> >> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >> >> >> >> >> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >> >> >> >> >> >> >> >> which was a fix to a bug: >> >> >> >> >> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >> >> >> >> try to assign a new pirq anyway. >> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >> >> >> >> device. >> >> >> >> >> >> >> >> I don't remember any of the details. From the description of this bug, >> >> >> >> it seems that Xen changed behavior in the past few years: before it used >> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >> >> >> >> have been completely freed, then reassigned to somebody else the way it >> >> >> >> is described in this email. >> >> >> >> >> >> >> >> I think we should indentify the changeset or Xen version that introduced >> >> >> >> the new behavior. If it is old enough, we might be able to just revert >> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >> >> >> >> behavior conditional to the Xen version. >> >> >> > >> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >> >> >> > That's where I'm seeing this problem, with PT NVMe devices. >> >> > >> >> > They are the main ones. It is possible to have emulated virtio devices >> >> > with emulated MSIs, for example virtio-net. Althought they are not in >> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >> >> > too. >> >> > >> >> > >> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >> >> >> > with the patch from this thread (which essentially reverts your commit >> >> >> > above) as well as some added debug to see the pirq numbers, cycles of >> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >> >> >> > hypervisor provides essentially the same pirqs for each modprobe, >> >> >> > since they were freed by the rmmod. >> >> > >> >> > I am fine with reverting the old patch, but we need to understand what >> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >> >> > passthrough setup at hand can help with that. >> >> > >> >> > In the Xen code I can still see: >> >> > >> >> > case ECS_PIRQ: { >> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >> >> > >> >> > if ( !pirq ) >> >> > break; >> >> > if ( !is_hvm_domain(d1) ) >> >> > pirq_guest_unbind(d1, pirq); >> >> > >> >> > which means that pirq_guest_unbind should only be called on evtchn_close >> >> > if the guest is not an HVM guest. >> >> >> >> I tried an experiment to call get_free_pirq on both sides of a >> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >> >> see something interesting; each nic uses 3 MSIs, and it looks like >> >> when they shut down, each nic's 3 pirqs are not available until after >> >> the nic is done shutting down, so it seems like closing the evtchn >> >> isn't what makes the pirq free. >> >> >> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >> >> >> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >> >> >> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >> >> 0, xen_pvh_domain() == 0 >> >> >> >> just to be sure, a bit of dbg so I know what domain this is :-) >> >> >> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >> >> >> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >> >> none of those become available for get_free_pirq. >> >> >> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >> >> >> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >> >> shutdown; we see those pirqs from nic 4 are now available. So it must >> >> not be evtchn closing that frees the pirqs. >> >> >> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >> >> >> >> >> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >> >> (and recompiled/rebooted) and found the pirqs have already been freed >> >> before that is called: >> >> >> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >> >> >> >> >> >> I'm still following thru the kernel code, but it's not immediately >> >> obvious what exactly is telling the hypervisor to free the pirqs; any >> >> idea? >> >> >> >> >From the hypervisor code, it seems that the pirq is "available" via >> >> is_free_pirq(): >> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >> >> >> >> when the evtchn is closed, it does: >> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >> >> >> >> and that call to unmap_domain_pirq_emuirq does: >> >> info->arch.hvm.emuirq = IRQ_UNBOUND; >> >> >> >> so, the only thing left is to clear pirq->arch.irq,but the only place >> >> I can find that does that is clear_domain_irq_pirq(), which is only >> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >> >> seeing where either of those would be called when all the kernel is >> >> doing is disabling a pci device. >> > >> > Thanks for the info. I think I know what causes the pirq to be unmapped: >> > when Linux disables msi or msix on the device, using the regular pci >> > config space based method, QEMU (which emulates the PCI config space) >> > tells Xen to unmap the pirq. >> >> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >> >> > >> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >> > to be called a second time without msis being disabled first, then I >> > think we can revert the patch. >> >> It doesn't seem possible to call it twice from a correctly-behaved >> driver, but in case of a driver bug that does try to enable msi/msix >> multiple times without disabling, __pci_enable_msix() only does >> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >> should be changed to warn and also return error, to prevent >> re-configuring msi/msix if already configured? Or, maybe the warning >> is enough - the worst thing that reverting the patch does is use extra >> pirqs, right? > > I think the warning is enough. Can you confirm that with > af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also > > ifconfig eth0 down; ifconfig eth0 up > > still work as expected, no warnings? yes, with the patch that started this thread - which essentially reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no warnings and ifconfig down ; ifconfig up work as expected. > > > It looks like the patch that changed hypervisor (QEMU actually) behavior > is: > > commit c976437c7dba9c7444fb41df45468968aaa326ad > Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> > Date: Wed May 7 13:41:48 2014 +0000 > > qemu-xen: free all the pirqs for msi/msix when driver unload > > From this commit onward, QEMU started unmapping pirqs when MSIs are > disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, > 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. > > If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on > all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 > and older. Given that Xen 4.4 is out of support, I think we should go > ahead with it. Opinions?
On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: > On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini > <sstabellini@kernel.org> wrote: >> On Wed, 11 Jan 2017, Dan Streetman wrote: >>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>> <sstabellini@kernel.org> wrote: >>> > On Tue, 10 Jan 2017, Dan Streetman wrote: >>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>> >> <sstabellini@kernel.org> wrote: >>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: >>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>> >> >> > <sstabellini@kernel.org> wrote: >>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: >>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was >>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was >>> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous >>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be >>> >> >> >>> > >>> allocated from the hypervisor. >>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs >>> >> >> >>> > >> as we are not reusing them? >>> >> >> >>> > > >>> >> >> >>> > > Don't we free the pirq when we unmap it? >>> >> >> >>> > >>> >> >> >>> > I think this is actually a bit worse than I initially thought. After >>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq >>> >> >> >>> > allocation: >>> >> >> >>> >>> >> >> >>> Lets include Stefano, >>> >> >> >>> >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu >>> >> >> >>> feeling as I believe I hit this at some point in the past and >>> >> >> >>> posted some possible ways of fixing this. But sadly I can't >>> >> >> >>> find the thread. >>> >> >> >> >>> >> >> >> This issue seems to be caused by: >>> >> >> >> >>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 >>> >> >> >> >>> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests >>> >> >> >> >>> >> >> >> which was a fix to a bug: >>> >> >> >> >>> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq >>> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>> >> >> >> try to assign a new pirq anyway. >>> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network >>> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding >>> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the >>> >> >> >> device. >>> >> >> >> >>> >> >> >> I don't remember any of the details. From the description of this bug, >>> >> >> >> it seems that Xen changed behavior in the past few years: before it used >>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>> >> >> >> have been completely freed, then reassigned to somebody else the way it >>> >> >> >> is described in this email. >>> >> >> >> >>> >> >> >> I think we should indentify the changeset or Xen version that introduced >>> >> >> >> the new behavior. If it is old enough, we might be able to just revert >>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>> >> >> >> behavior conditional to the Xen version. >>> >> >> > >>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? >>> >> >> > That's where I'm seeing this problem, with PT NVMe devices. >>> >> > >>> >> > They are the main ones. It is possible to have emulated virtio devices >>> >> > with emulated MSIs, for example virtio-net. Althought they are not in >>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>> >> > too. >>> >> > >>> >> > >>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, >>> >> >> > with the patch from this thread (which essentially reverts your commit >>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of >>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>> >> >> > hypervisor provides essentially the same pirqs for each modprobe, >>> >> >> > since they were freed by the rmmod. >>> >> > >>> >> > I am fine with reverting the old patch, but we need to understand what >>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI >>> >> > passthrough setup at hand can help with that. >>> >> > >>> >> > In the Xen code I can still see: >>> >> > >>> >> > case ECS_PIRQ: { >>> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>> >> > >>> >> > if ( !pirq ) >>> >> > break; >>> >> > if ( !is_hvm_domain(d1) ) >>> >> > pirq_guest_unbind(d1, pirq); >>> >> > >>> >> > which means that pirq_guest_unbind should only be called on evtchn_close >>> >> > if the guest is not an HVM guest. >>> >> >>> >> I tried an experiment to call get_free_pirq on both sides of a >>> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>> >> see something interesting; each nic uses 3 MSIs, and it looks like >>> >> when they shut down, each nic's 3 pirqs are not available until after >>> >> the nic is done shutting down, so it seems like closing the evtchn >>> >> isn't what makes the pirq free. >>> >> >>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>> >> >>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>> >> >>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>> >> 0, xen_pvh_domain() == 0 >>> >> >>> >> just to be sure, a bit of dbg so I know what domain this is :-) >>> >> >>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>> >> >>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>> >> none of those become available for get_free_pirq. >>> >> >>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>> >> >>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>> >> shutdown; we see those pirqs from nic 4 are now available. So it must >>> >> not be evtchn closing that frees the pirqs. >>> >> >>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>> >> >>> >> >>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>> >> (and recompiled/rebooted) and found the pirqs have already been freed >>> >> before that is called: >>> >> >>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>> >> >>> >> >>> >> I'm still following thru the kernel code, but it's not immediately >>> >> obvious what exactly is telling the hypervisor to free the pirqs; any >>> >> idea? >>> >> >>> >> >From the hypervisor code, it seems that the pirq is "available" via >>> >> is_free_pirq(): >>> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>> >> >>> >> when the evtchn is closed, it does: >>> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>> >> >>> >> and that call to unmap_domain_pirq_emuirq does: >>> >> info->arch.hvm.emuirq = IRQ_UNBOUND; >>> >> >>> >> so, the only thing left is to clear pirq->arch.irq,but the only place >>> >> I can find that does that is clear_domain_irq_pirq(), which is only >>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>> >> seeing where either of those would be called when all the kernel is >>> >> doing is disabling a pci device. >>> > >>> > Thanks for the info. I think I know what causes the pirq to be unmapped: >>> > when Linux disables msi or msix on the device, using the regular pci >>> > config space based method, QEMU (which emulates the PCI config space) >>> > tells Xen to unmap the pirq. >>> >>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>> >>> > >>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>> > to be called a second time without msis being disabled first, then I >>> > think we can revert the patch. >>> >>> It doesn't seem possible to call it twice from a correctly-behaved >>> driver, but in case of a driver bug that does try to enable msi/msix >>> multiple times without disabling, __pci_enable_msix() only does >>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>> should be changed to warn and also return error, to prevent >>> re-configuring msi/msix if already configured? Or, maybe the warning >>> is enough - the worst thing that reverting the patch does is use extra >>> pirqs, right? >> >> I think the warning is enough. Can you confirm that with >> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >> >> ifconfig eth0 down; ifconfig eth0 up >> >> still work as expected, no warnings? > > yes, with the patch that started this thread - which essentially > reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no > warnings and ifconfig down ; ifconfig up work as expected. > >> >> >> It looks like the patch that changed hypervisor (QEMU actually) behavior >> is: >> >> commit c976437c7dba9c7444fb41df45468968aaa326ad >> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >> Date: Wed May 7 13:41:48 2014 +0000 >> >> qemu-xen: free all the pirqs for msi/msix when driver unload >> >> From this commit onward, QEMU started unmapping pirqs when MSIs are >> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >> >> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >> and older. Given that Xen 4.4 is out of support, I think we should go >> ahead with it. Opinions? Looks like there's no complaints; is my patch from the start of this thread ok to use, or can you craft a patch to use? My patch's description could use updating to add some of the info/background from this discussion...
On Fri, 13 Jan 2017, Dan Streetman wrote: > On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: > > On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini > > <sstabellini@kernel.org> wrote: > >> On Wed, 11 Jan 2017, Dan Streetman wrote: > >>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini > >>> <sstabellini@kernel.org> wrote: > >>> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >>> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini > >>> >> <sstabellini@kernel.org> wrote: > >>> >> > On Tue, 10 Jan 2017, Dan Streetman wrote: > >>> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: > >>> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini > >>> >> >> > <sstabellini@kernel.org> wrote: > >>> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: > >>> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > >>> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > >>> >> >> >>> > <boris.ostrovsky@oracle.com> wrote: > >>> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >>> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was > >>> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was > >>> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous > >>> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be > >>> >> >> >>> > >>> allocated from the hypervisor. > >>> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >>> >> >> >>> > >> as we are not reusing them? > >>> >> >> >>> > > > >>> >> >> >>> > > Don't we free the pirq when we unmap it? > >>> >> >> >>> > > >>> >> >> >>> > I think this is actually a bit worse than I initially thought. After > >>> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq > >>> >> >> >>> > allocation: > >>> >> >> >>> > >>> >> >> >>> Lets include Stefano, > >>> >> >> >>> > >>> >> >> >>> Thank you for digging in this! This has quite the deja-vu > >>> >> >> >>> feeling as I believe I hit this at some point in the past and > >>> >> >> >>> posted some possible ways of fixing this. But sadly I can't > >>> >> >> >>> find the thread. > >>> >> >> >> > >>> >> >> >> This issue seems to be caused by: > >>> >> >> >> > >>> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f > >>> >> >> >> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > >>> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000 > >>> >> >> >> > >>> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests > >>> >> >> >> > >>> >> >> >> which was a fix to a bug: > >>> >> >> >> > >>> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when > >>> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq > >>> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would > >>> >> >> >> try to assign a new pirq anyway. > >>> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network > >>> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding > >>> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the > >>> >> >> >> device. > >>> >> >> >> > >>> >> >> >> I don't remember any of the details. From the description of this bug, > >>> >> >> >> it seems that Xen changed behavior in the past few years: before it used > >>> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the > >>> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't > >>> >> >> >> have been completely freed, then reassigned to somebody else the way it > >>> >> >> >> is described in this email. > >>> >> >> >> > >>> >> >> >> I think we should indentify the changeset or Xen version that introduced > >>> >> >> >> the new behavior. If it is old enough, we might be able to just revert > >>> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the > >>> >> >> >> behavior conditional to the Xen version. > >>> >> >> > > >>> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest? > >>> >> >> > That's where I'm seeing this problem, with PT NVMe devices. > >>> >> > > >>> >> > They are the main ones. It is possible to have emulated virtio devices > >>> >> > with emulated MSIs, for example virtio-net. Althought they are not in > >>> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken > >>> >> > too. > >>> >> > > >>> >> > > >>> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on, > >>> >> >> > with the patch from this thread (which essentially reverts your commit > >>> >> >> > above) as well as some added debug to see the pirq numbers, cycles of > >>> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the > >>> >> >> > hypervisor provides essentially the same pirqs for each modprobe, > >>> >> >> > since they were freed by the rmmod. > >>> >> > > >>> >> > I am fine with reverting the old patch, but we need to understand what > >>> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI > >>> >> > passthrough setup at hand can help with that. > >>> >> > > >>> >> > In the Xen code I can still see: > >>> >> > > >>> >> > case ECS_PIRQ: { > >>> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); > >>> >> > > >>> >> > if ( !pirq ) > >>> >> > break; > >>> >> > if ( !is_hvm_domain(d1) ) > >>> >> > pirq_guest_unbind(d1, pirq); > >>> >> > > >>> >> > which means that pirq_guest_unbind should only be called on evtchn_close > >>> >> > if the guest is not an HVM guest. > >>> >> > >>> >> I tried an experiment to call get_free_pirq on both sides of a > >>> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I > >>> >> see something interesting; each nic uses 3 MSIs, and it looks like > >>> >> when they shut down, each nic's 3 pirqs are not available until after > >>> >> the nic is done shutting down, so it seems like closing the evtchn > >>> >> isn't what makes the pirq free. > >>> >> > >>> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 > >>> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 > >>> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 > >>> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps > >>> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 > >>> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 > >>> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 > >>> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps > >>> >> > >>> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. > >>> >> > >>> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, > >>> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == > >>> >> 0, xen_pvh_domain() == 0 > >>> >> > >>> >> just to be sure, a bit of dbg so I know what domain this is :-) > >>> >> > >>> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 > >>> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 > >>> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 > >>> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 > >>> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 > >>> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 > >>> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 > >>> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 > >>> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 > >>> >> > >>> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but > >>> >> none of those become available for get_free_pirq. > >>> >> > >>> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 > >>> >> > >>> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started > >>> >> shutdown; we see those pirqs from nic 4 are now available. So it must > >>> >> not be evtchn closing that frees the pirqs. > >>> >> > >>> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 > >>> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 > >>> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 > >>> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 > >>> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 > >>> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 > >>> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 > >>> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 > >>> >> > >>> >> > >>> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() > >>> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() > >>> >> (and recompiled/rebooted) and found the pirqs have already been freed > >>> >> before that is called: > >>> >> > >>> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 > >>> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 > >>> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 > >>> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 > >>> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 > >>> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 > >>> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 > >>> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 > >>> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 > >>> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 > >>> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 > >>> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 > >>> >> > >>> >> > >>> >> I'm still following thru the kernel code, but it's not immediately > >>> >> obvious what exactly is telling the hypervisor to free the pirqs; any > >>> >> idea? > >>> >> > >>> >> >From the hypervisor code, it seems that the pirq is "available" via > >>> >> is_free_pirq(): > >>> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || > >>> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); > >>> >> > >>> >> when the evtchn is closed, it does: > >>> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) > >>> >> unmap_domain_pirq_emuirq(d1, pirq->pirq); > >>> >> > >>> >> and that call to unmap_domain_pirq_emuirq does: > >>> >> info->arch.hvm.emuirq = IRQ_UNBOUND; > >>> >> > >>> >> so, the only thing left is to clear pirq->arch.irq,but the only place > >>> >> I can find that does that is clear_domain_irq_pirq(), which is only > >>> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not > >>> >> seeing where either of those would be called when all the kernel is > >>> >> doing is disabling a pci device. > >>> > > >>> > Thanks for the info. I think I know what causes the pirq to be unmapped: > >>> > when Linux disables msi or msix on the device, using the regular pci > >>> > config space based method, QEMU (which emulates the PCI config space) > >>> > tells Xen to unmap the pirq. > >>> > >>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. > >>> > >>> > > >>> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs > >>> > to be called a second time without msis being disabled first, then I > >>> > think we can revert the patch. > >>> > >>> It doesn't seem possible to call it twice from a correctly-behaved > >>> driver, but in case of a driver bug that does try to enable msi/msix > >>> multiple times without disabling, __pci_enable_msix() only does > >>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does > >>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that > >>> should be changed to warn and also return error, to prevent > >>> re-configuring msi/msix if already configured? Or, maybe the warning > >>> is enough - the worst thing that reverting the patch does is use extra > >>> pirqs, right? > >> > >> I think the warning is enough. Can you confirm that with > >> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also > >> > >> ifconfig eth0 down; ifconfig eth0 up > >> > >> still work as expected, no warnings? > > > > yes, with the patch that started this thread - which essentially > > reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no > > warnings and ifconfig down ; ifconfig up work as expected. > > > >> > >> > >> It looks like the patch that changed hypervisor (QEMU actually) behavior > >> is: > >> > >> commit c976437c7dba9c7444fb41df45468968aaa326ad > >> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> > >> Date: Wed May 7 13:41:48 2014 +0000 > >> > >> qemu-xen: free all the pirqs for msi/msix when driver unload > >> > >> From this commit onward, QEMU started unmapping pirqs when MSIs are > >> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, > >> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. > >> > >> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on > >> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 > >> and older. Given that Xen 4.4 is out of support, I think we should go > >> ahead with it. Opinions? > > Looks like there's no complaints; is my patch from the start of this > thread ok to use, or can you craft a patch to use? My patch's > description could use updating to add some of the info/background from > this discussion... Hi Dan, I would like an explicit Ack from the other maintainers, Boris and Juergen. Let me place them in To: to make it more obvious.
On 01/13/2017 01:44 PM, Stefano Stabellini wrote: > On Fri, 13 Jan 2017, Dan Streetman wrote: >> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: >>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini >>> <sstabellini@kernel.org> wrote: >>>> On Wed, 11 Jan 2017, Dan Streetman wrote: >>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>>>> <sstabellini@kernel.org> wrote: >>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>>>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was >>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was >>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device. The previous >>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be >>>>>>>>>>>>>>>> allocated from the hypervisor. >>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs >>>>>>>>>>>>>>> as we are not reusing them? >>>>>>>>>>>>>> Don't we free the pirq when we unmap it? >>>>>>>>>>>>> I think this is actually a bit worse than I initially thought. After >>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq >>>>>>>>>>>>> allocation: >>>>>>>>>>>> Lets include Stefano, >>>>>>>>>>>> >>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu >>>>>>>>>>>> feeling as I believe I hit this at some point in the past and >>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't >>>>>>>>>>>> find the thread. >>>>>>>>>>> This issue seems to be caused by: >>>>>>>>>>> >>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>>>>>>>>>> Date: Wed Dec 1 14:51:44 2010 +0000 >>>>>>>>>>> >>>>>>>>>>> xen: fix MSI setup and teardown for PV on HVM guests >>>>>>>>>>> >>>>>>>>>>> which was a fix to a bug: >>>>>>>>>>> >>>>>>>>>>> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>>>>>>>>>> trying to enable the same MSI for the second time: the old MSI to pirq >>>>>>>>>>> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>>>>>>>>>> try to assign a new pirq anyway. >>>>>>>>>>> A simple way to reproduce this bug is to assign an MSI capable network >>>>>>>>>>> card to a PV on HVM guest, if the user brings down the corresponding >>>>>>>>>>> ethernet interface and up again, Linux would fail to enable MSIs on the >>>>>>>>>>> device. >>>>>>>>>>> >>>>>>>>>>> I don't remember any of the details. From the description of this bug, >>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used >>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it >>>>>>>>>>> is described in this email. >>>>>>>>>>> >>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced >>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert >>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>>>>>>>>>> behavior conditional to the Xen version. >>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest? >>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices. >>>>>>>> They are the main ones. It is possible to have emulated virtio devices >>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in >>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>>>>>>> too. >>>>>>>> >>>>>>>> >>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on, >>>>>>>>>> with the patch from this thread (which essentially reverts your commit >>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of >>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe, >>>>>>>>>> since they were freed by the rmmod. >>>>>>>> I am fine with reverting the old patch, but we need to understand what >>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI >>>>>>>> passthrough setup at hand can help with that. >>>>>>>> >>>>>>>> In the Xen code I can still see: >>>>>>>> >>>>>>>> case ECS_PIRQ: { >>>>>>>> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>>>>>>> >>>>>>>> if ( !pirq ) >>>>>>>> break; >>>>>>>> if ( !is_hvm_domain(d1) ) >>>>>>>> pirq_guest_unbind(d1, pirq); >>>>>>>> >>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close >>>>>>>> if the guest is not an HVM guest. >>>>>>> I tried an experiment to call get_free_pirq on both sides of a >>>>>>> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like >>>>>>> when they shut down, each nic's 3 pirqs are not available until after >>>>>>> the nic is done shutting down, so it seems like closing the evtchn >>>>>>> isn't what makes the pirq free. >>>>>>> >>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>>>>>> >>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>>>>>> >>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>>>>>> 0, xen_pvh_domain() == 0 >>>>>>> >>>>>>> just to be sure, a bit of dbg so I know what domain this is :-) >>>>>>> >>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>>>>>> >>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>>>>>> none of those become available for get_free_pirq. >>>>>>> >>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>>>>>> >>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>>>>>> shutdown; we see those pirqs from nic 4 are now available. So it must >>>>>>> not be evtchn closing that frees the pirqs. >>>>>>> >>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>>>>>> >>>>>>> >>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed >>>>>>> before that is called: >>>>>>> >>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>>>>>> >>>>>>> >>>>>>> I'm still following thru the kernel code, but it's not immediately >>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any >>>>>>> idea? >>>>>>> >>>>>>> >From the hypervisor code, it seems that the pirq is "available" via >>>>>>> is_free_pirq(): >>>>>>> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>>>>>> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>>>>>> >>>>>>> when the evtchn is closed, it does: >>>>>>> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>>>>>> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>>>>>> >>>>>>> and that call to unmap_domain_pirq_emuirq does: >>>>>>> info->arch.hvm.emuirq = IRQ_UNBOUND; >>>>>>> >>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place >>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only >>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>>>>>> seeing where either of those would be called when all the kernel is >>>>>>> doing is disabling a pci device. >>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped: >>>>>> when Linux disables msi or msix on the device, using the regular pci >>>>>> config space based method, QEMU (which emulates the PCI config space) >>>>>> tells Xen to unmap the pirq. >>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>>>> >>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>>>>> to be called a second time without msis being disabled first, then I >>>>>> think we can revert the patch. >>>>> It doesn't seem possible to call it twice from a correctly-behaved >>>>> driver, but in case of a driver bug that does try to enable msi/msix >>>>> multiple times without disabling, __pci_enable_msix() only does >>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>>>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>>>> should be changed to warn and also return error, to prevent >>>>> re-configuring msi/msix if already configured? Or, maybe the warning >>>>> is enough - the worst thing that reverting the patch does is use extra >>>>> pirqs, right? >>>> I think the warning is enough. Can you confirm that with >>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >>>> >>>> ifconfig eth0 down; ifconfig eth0 up >>>> >>>> still work as expected, no warnings? >>> yes, with the patch that started this thread - which essentially >>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no >>> warnings and ifconfig down ; ifconfig up work as expected. >>> >>>> >>>> It looks like the patch that changed hypervisor (QEMU actually) behavior >>>> is: >>>> >>>> commit c976437c7dba9c7444fb41df45468968aaa326ad >>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >>>> Date: Wed May 7 13:41:48 2014 +0000 >>>> >>>> qemu-xen: free all the pirqs for msi/msix when driver unload >>>> >>>> From this commit onward, QEMU started unmapping pirqs when MSIs are >>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >>>> >>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >>>> and older. Given that Xen 4.4 is out of support, I think we should go >>>> ahead with it. Opinions? >> Looks like there's no complaints; is my patch from the start of this >> thread ok to use, or can you craft a patch to use? My patch's >> description could use updating to add some of the info/background from >> this discussion... > Hi Dan, I would like an explicit Ack from the other maintainers, Boris > and Juergen. Let me place them in To: to make it more obvious. Where is the patch? I don't think 'git revert' will work. And Konrad will need to ack it too as he is Xen-PCI maintainer. -boris
On Fri, Jan 13, 2017 at 3:00 PM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > On 01/13/2017 01:44 PM, Stefano Stabellini wrote: >> On Fri, 13 Jan 2017, Dan Streetman wrote: >>> On Wed, Jan 11, 2017 at 6:25 PM, Dan Streetman <ddstreet@ieee.org> wrote: >>>> On Wed, Jan 11, 2017 at 1:46 PM, Stefano Stabellini >>>> <sstabellini@kernel.org> wrote: >>>>> On Wed, 11 Jan 2017, Dan Streetman wrote: >>>>>> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini >>>>>> <sstabellini@kernel.org> wrote: >>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini >>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>> On Tue, 10 Jan 2017, Dan Streetman wrote: >>>>>>>>>> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@ieee.org> wrote: >>>>>>>>>>> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini >>>>>>>>>>> <sstabellini@kernel.org> wrote: >>>>>>>>>>>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: >>>>>>>>>>>>>> On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky >>>>>>>>>>>>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>>>>>>>>>> On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: >>>>>>>>>>>>>>>> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: >>>>>>>>>>>>>>>>> Do not read a pci device's msi message data to see if a pirq was >>>>>>>>>>>>>>>>> previously configured for the device's msi/msix, as the old pirq was >>>>>>>>>>>>>>>>> unmapped and may now be in use by another pci device. The previous >>>>>>>>>>>>>>>>> pirq should never be re-used; instead a new pirq should always be >>>>>>>>>>>>>>>>> allocated from the hypervisor. >>>>>>>>>>>>>>>> Won't this cause a starvation problem? That is we will run out of PIRQs >>>>>>>>>>>>>>>> as we are not reusing them? >>>>>>>>>>>>>>> Don't we free the pirq when we unmap it? >>>>>>>>>>>>>> I think this is actually a bit worse than I initially thought. After >>>>>>>>>>>>>> looking a bit closer, and I think there's an asymmetry with pirq >>>>>>>>>>>>>> allocation: >>>>>>>>>>>>> Lets include Stefano, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for digging in this! This has quite the deja-vu >>>>>>>>>>>>> feeling as I believe I hit this at some point in the past and >>>>>>>>>>>>> posted some possible ways of fixing this. But sadly I can't >>>>>>>>>>>>> find the thread. >>>>>>>>>>>> This issue seems to be caused by: >>>>>>>>>>>> >>>>>>>>>>>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f >>>>>>>>>>>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> >>>>>>>>>>>> Date: Wed Dec 1 14:51:44 2010 +0000 >>>>>>>>>>>> >>>>>>>>>>>> xen: fix MSI setup and teardown for PV on HVM guests >>>>>>>>>>>> >>>>>>>>>>>> which was a fix to a bug: >>>>>>>>>>>> >>>>>>>>>>>> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when >>>>>>>>>>>> trying to enable the same MSI for the second time: the old MSI to pirq >>>>>>>>>>>> mapping is still valid at this point but xen_hvm_setup_msi_irqs would >>>>>>>>>>>> try to assign a new pirq anyway. >>>>>>>>>>>> A simple way to reproduce this bug is to assign an MSI capable network >>>>>>>>>>>> card to a PV on HVM guest, if the user brings down the corresponding >>>>>>>>>>>> ethernet interface and up again, Linux would fail to enable MSIs on the >>>>>>>>>>>> device. >>>>>>>>>>>> >>>>>>>>>>>> I don't remember any of the details. From the description of this bug, >>>>>>>>>>>> it seems that Xen changed behavior in the past few years: before it used >>>>>>>>>>>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the >>>>>>>>>>>> old MSI to pirq mapping is still valid at this point", the pirq couldn't >>>>>>>>>>>> have been completely freed, then reassigned to somebody else the way it >>>>>>>>>>>> is described in this email. >>>>>>>>>>>> >>>>>>>>>>>> I think we should indentify the changeset or Xen version that introduced >>>>>>>>>>>> the new behavior. If it is old enough, we might be able to just revert >>>>>>>>>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the >>>>>>>>>>>> behavior conditional to the Xen version. >>>>>>>>>>> Are PT devices the only MSI-capable devices available in a Xen guest? >>>>>>>>>>> That's where I'm seeing this problem, with PT NVMe devices. >>>>>>>>> They are the main ones. It is possible to have emulated virtio devices >>>>>>>>> with emulated MSIs, for example virtio-net. Althought they are not in >>>>>>>>> the Xen Project CI-loop, so I wouldn't be surprised if they are broken >>>>>>>>> too. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> I can say that on the Xen guest with NVMe PT devices I'm testing on, >>>>>>>>>>> with the patch from this thread (which essentially reverts your commit >>>>>>>>>>> above) as well as some added debug to see the pirq numbers, cycles of >>>>>>>>>>> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the >>>>>>>>>>> hypervisor provides essentially the same pirqs for each modprobe, >>>>>>>>>>> since they were freed by the rmmod. >>>>>>>>> I am fine with reverting the old patch, but we need to understand what >>>>>>>>> caused the change in behavior first. Maybe somebody else with a Xen PCI >>>>>>>>> passthrough setup at hand can help with that. >>>>>>>>> >>>>>>>>> In the Xen code I can still see: >>>>>>>>> >>>>>>>>> case ECS_PIRQ: { >>>>>>>>> struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq); >>>>>>>>> >>>>>>>>> if ( !pirq ) >>>>>>>>> break; >>>>>>>>> if ( !is_hvm_domain(d1) ) >>>>>>>>> pirq_guest_unbind(d1, pirq); >>>>>>>>> >>>>>>>>> which means that pirq_guest_unbind should only be called on evtchn_close >>>>>>>>> if the guest is not an HVM guest. >>>>>>>> I tried an experiment to call get_free_pirq on both sides of a >>>>>>>> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I >>>>>>>> see something interesting; each nic uses 3 MSIs, and it looks like >>>>>>>> when they shut down, each nic's 3 pirqs are not available until after >>>>>>>> the nic is done shutting down, so it seems like closing the evtchn >>>>>>>> isn't what makes the pirq free. >>>>>>>> >>>>>>>> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290 >>>>>>>> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291 >>>>>>>> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292 >>>>>>>> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps >>>>>>>> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293 >>>>>>>> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294 >>>>>>>> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295 >>>>>>>> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps >>>>>>>> >>>>>>>> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98. >>>>>>>> >>>>>>>> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1, >>>>>>>> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() == >>>>>>>> 0, xen_pvh_domain() == 0 >>>>>>>> >>>>>>>> just to be sure, a bit of dbg so I know what domain this is :-) >>>>>>>> >>>>>>>> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93 >>>>>>>> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295 >>>>>>>> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92 >>>>>>>> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91 >>>>>>>> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294 >>>>>>>> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90 >>>>>>>> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89 >>>>>>>> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293 >>>>>>>> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88 >>>>>>>> >>>>>>>> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but >>>>>>>> none of those become available for get_free_pirq. >>>>>>>> >>>>>>>> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98 >>>>>>>> >>>>>>>> aha, now nic 4 has fully finished shutting down, and nic 3 has started >>>>>>>> shutdown; we see those pirqs from nic 4 are now available. So it must >>>>>>>> not be evtchn closing that frees the pirqs. >>>>>>>> >>>>>>>> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292 >>>>>>>> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97 >>>>>>>> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96 >>>>>>>> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291 >>>>>>>> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87 >>>>>>>> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86 >>>>>>>> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290 >>>>>>>> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85 >>>>>>>> >>>>>>>> >>>>>>>> so, since pci_disable_msix eventually calls xen_teardown_msi_irq() >>>>>>>> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq() >>>>>>>> (and recompiled/rebooted) and found the pirqs have already been freed >>>>>>>> before that is called: >>>>>>>> >>>>>>>> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295 >>>>>>>> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294 >>>>>>>> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293 >>>>>>>> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100 >>>>>>>> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293 >>>>>>>> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99 >>>>>>>> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98 >>>>>>>> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294 >>>>>>>> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97 >>>>>>>> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96 >>>>>>>> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295 >>>>>>>> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95 >>>>>>>> >>>>>>>> >>>>>>>> I'm still following thru the kernel code, but it's not immediately >>>>>>>> obvious what exactly is telling the hypervisor to free the pirqs; any >>>>>>>> idea? >>>>>>>> >>>>>>>> >From the hypervisor code, it seems that the pirq is "available" via >>>>>>>> is_free_pirq(): >>>>>>>> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) || >>>>>>>> pirq->arch.hvm.emuirq == IRQ_UNBOUND)); >>>>>>>> >>>>>>>> when the evtchn is closed, it does: >>>>>>>> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 ) >>>>>>>> unmap_domain_pirq_emuirq(d1, pirq->pirq); >>>>>>>> >>>>>>>> and that call to unmap_domain_pirq_emuirq does: >>>>>>>> info->arch.hvm.emuirq = IRQ_UNBOUND; >>>>>>>> >>>>>>>> so, the only thing left is to clear pirq->arch.irq,but the only place >>>>>>>> I can find that does that is clear_domain_irq_pirq(), which is only >>>>>>>> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not >>>>>>>> seeing where either of those would be called when all the kernel is >>>>>>>> doing is disabling a pci device. >>>>>>> Thanks for the info. I think I know what causes the pirq to be unmapped: >>>>>>> when Linux disables msi or msix on the device, using the regular pci >>>>>>> config space based method, QEMU (which emulates the PCI config space) >>>>>>> tells Xen to unmap the pirq. >>>>>> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then. >>>>>> >>>>>>> If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs >>>>>>> to be called a second time without msis being disabled first, then I >>>>>>> think we can revert the patch. >>>>>> It doesn't seem possible to call it twice from a correctly-behaved >>>>>> driver, but in case of a driver bug that does try to enable msi/msix >>>>>> multiple times without disabling, __pci_enable_msix() only does >>>>>> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does >>>>>> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that >>>>>> should be changed to warn and also return error, to prevent >>>>>> re-configuring msi/msix if already configured? Or, maybe the warning >>>>>> is enough - the worst thing that reverting the patch does is use extra >>>>>> pirqs, right? >>>>> I think the warning is enough. Can you confirm that with >>>>> af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also >>>>> >>>>> ifconfig eth0 down; ifconfig eth0 up >>>>> >>>>> still work as expected, no warnings? >>>> yes, with the patch that started this thread - which essentially >>>> reverts af42b8d12f8adec6711cb824549a0edac6a4ae8f - there are no >>>> warnings and ifconfig down ; ifconfig up work as expected. >>>> >>>>> >>>>> It looks like the patch that changed hypervisor (QEMU actually) behavior >>>>> is: >>>>> >>>>> commit c976437c7dba9c7444fb41df45468968aaa326ad >>>>> Author: Zhenzhong Duan <zhenzhong.duan@oracle.com> >>>>> Date: Wed May 7 13:41:48 2014 +0000 >>>>> >>>>> qemu-xen: free all the pirqs for msi/msix when driver unload >>>>> >>>>> From this commit onward, QEMU started unmapping pirqs when MSIs are >>>>> disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8, >>>>> 4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4. >>>>> >>>>> If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on >>>>> all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4 >>>>> and older. Given that Xen 4.4 is out of support, I think we should go >>>>> ahead with it. Opinions? >>> Looks like there's no complaints; is my patch from the start of this >>> thread ok to use, or can you craft a patch to use? My patch's >>> description could use updating to add some of the info/background from >>> this discussion... >> Hi Dan, I would like an explicit Ack from the other maintainers, Boris >> and Juergen. Let me place them in To: to make it more obvious. > > > Where is the patch? I don't think 'git revert' will work. I just re-sent the same patch that originally started this long thread, except with the description updated to include details on the previous commits and a Fixes: line. It reverts the main part of commit af42b8d12f8adec6711cb824549a0edac6a4ae8f, but there are other changes in that commit that either don't apply anymore and/or were already removed/changed (like changes to xen_allocate_pirq_msi function), or changes that don't need reverting (like adding the XEN_PIRQ_MSI_DATA #define, which should stay). The only part that really needed reverting was the code in xen_hvm_setup_msi_irqs() that checked the pirq number that was cached in the pci msi message data. Thanks! > > And Konrad will need to ack it too as he is Xen-PCI maintainer. > > -boris >
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 137bd0e..3174923 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data) return; mask_evtchn(evtchn); - xen_evtchn_close(evtchn); + if (!xen_hvm_domain()) + xen_evtchn_close(evtchn); xen_irq_info_cleanup(info); }