diff mbox series

PCI: Mark Intel bridge on SuperMicro Atom C3xxx motherboards to avoid bus reset

Message ID 20190524153118.GA12862@libmpq.org (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Mark Intel bridge on SuperMicro Atom C3xxx motherboards to avoid bus reset | expand

Commit Message

Maik Broemme May 24, 2019, 3:31 p.m. UTC
The Intel PCI bridge on SuperMicro Atom C3xxx motherboards do not
successfully complete a bus reset when used with certain child devices.
After the reset, config accesses to the child may fail. If assigning
such device via VFIO it will immediately fail with:

  vfio-pci 0000:01:00.0: Failed to return from FLR
  vfio-pci 0000:01:00.0: timed out waiting for pending transaction;
  performing function level reset anyway

Device will disappear from PCI device list:

  !!! Unknown header type 7f
  Kernel driver in use: vfio-pci
  Kernel modules: ddbridge

The attached patch will mark the root port as incapable of doing a
bus level reset. After that all my tested devices survive a VFIO
assignment and several VM reboot cycles.

Signed-off-by: Maik Broemme <mbroemme@libmpq.org>
---
 drivers/pci/quirks.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Alex Williamson May 24, 2019, 4:40 p.m. UTC | #1
On Fri, 24 May 2019 17:31:18 +0200
Maik Broemme <mbroemme@libmpq.org> wrote:

> The Intel PCI bridge on SuperMicro Atom C3xxx motherboards do not
> successfully complete a bus reset when used with certain child devices.

What are these 'certain child devices'?  We can't really regression
test to know if/when the problem might be resolved if we don't know
what to test.  Do these devices reset properly in other systems?  Are
there any devices that can do a bus reset properly on this system?  We'd
really only want to blacklist bus reset on this root port(?) if this is
a systemic problem with the root port, which is not clearly proven
here.  Thanks,

Alex

> After the reset, config accesses to the child may fail. If assigning
> such device via VFIO it will immediately fail with:
> 
>   vfio-pci 0000:01:00.0: Failed to return from FLR
>   vfio-pci 0000:01:00.0: timed out waiting for pending transaction;
>   performing function level reset anyway
> 
> Device will disappear from PCI device list:
> 
>   !!! Unknown header type 7f
>   Kernel driver in use: vfio-pci
>   Kernel modules: ddbridge
> 
> The attached patch will mark the root port as incapable of doing a
> bus level reset. After that all my tested devices survive a VFIO
> assignment and several VM reboot cycles.
> 
> Signed-off-by: Maik Broemme <mbroemme@libmpq.org>
> ---
>  drivers/pci/quirks.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0f16acc323c6..86cd42872708 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3433,6 +3433,13 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
>   */
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
>  
> +/*
> + * Root port on some SuperMicro Atom C3xxx motherboards do not successfully
> + * complete a bus reset when used with certain child devices. After the
> + * reset, config accesses to the child may fail.
> + */
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19a4, quirk_no_bus_reset);
> +
>  static void quirk_no_pm_reset(struct pci_dev *dev)
>  {
>  	/*
Maik Broemme May 24, 2019, 6:41 p.m. UTC | #2
Hi Alex,

On May 24, 2019, at 18:49, Alex Williamson <alex.williamson@redhat.com> wrote:
> On Fri, 24 May 2019 17:31:18 +0200
> Maik Broemme <mbroemme@libmpq.org> wrote:
> 
> > The Intel PCI bridge on SuperMicro Atom C3xxx motherboards do not
> > successfully complete a bus reset when used with certain child devices.
> 
> What are these 'certain child devices'?  We can't really regression
> test to know if/when the problem might be resolved if we don't know
> what to test.

I tried the following devices:

- Digital Devices GmbH Octopus DVB Adapter
- Digital Devices GmbH Cine S2 V6.5
- Digital Devices GmbH Cine S2 V7
- RealTek RTL8111D
- RealTek RTL8168B
- Intel I210-T1

> Do these devices reset properly in other systems?

All these cards survive VFIO reset and VM reboot cycles in another
motherboard (SuperMicro X11SPM-F). They only fail in the SuperMicro
A2SDi-*C-HLN4F series. I have a 8 core and 16 core version of this
motherboard.

I've tried a passive Mini PCI-E adapter (MikroTik RB11E) in the slot
with several wireless adapters (don't remember them all) and the
'Digital Devices GmbH Octopus DVB Adapter' which is Mini PCI-E. They all
produced the same error 'Failed to return from FLR' and '!!! Unknown
header type 7f'

Also I've tried a PCI-E switch from PLX technology, sold by MikroTik, the
RouterBoard RB14eU. It is exports 4 Mini PCI ports in one PCI-E port and
I tried it with one card and multiple cards.

All these devices start to work once I enabled the bus reset quirk. The
RB14eU even allows to assign the individual Mini PCI-E ports to
different VMs and survive independent resets behind the PLX bridge.

> Are there any devices that can do a bus reset properly on this system?

I'm pretty sure not, of course I can test only devices I own and at
least the Intel I210-T1 supports all functionality to do a proper reset.

I own these motherboards since ~2 years and tried almost any device I
had during this time.

> We'd really only want to blacklist bus reset on this root port(?) if this is
> a systemic problem with the root port, which is not clearly proven
> here.  Thanks,

I can rework the patch and apply the quirk only to my tested devices but
I strongly believe that it is an issue on the root port, independent
from the device.

> 
> Alex
> 
> > After the reset, config accesses to the child may fail. If assigning
> > such device via VFIO it will immediately fail with:
> > 
> >   vfio-pci 0000:01:00.0: Failed to return from FLR
> >   vfio-pci 0000:01:00.0: timed out waiting for pending transaction;
> >   performing function level reset anyway
> > 
> > Device will disappear from PCI device list:
> > 
> >   !!! Unknown header type 7f
> >   Kernel driver in use: vfio-pci
> >   Kernel modules: ddbridge
> > 
> > The attached patch will mark the root port as incapable of doing a
> > bus level reset. After that all my tested devices survive a VFIO
> > assignment and several VM reboot cycles.
> > 
> > Signed-off-by: Maik Broemme <mbroemme@libmpq.org>
> > ---
> >  drivers/pci/quirks.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 0f16acc323c6..86cd42872708 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3433,6 +3433,13 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> >   */
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
> >  
> > +/*
> > + * Root port on some SuperMicro Atom C3xxx motherboards do not successfully
> > + * complete a bus reset when used with certain child devices. After the
> > + * reset, config accesses to the child may fail.
> > + */
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19a4, quirk_no_bus_reset);
> > +
> >  static void quirk_no_pm_reset(struct pci_dev *dev)
> >  {
> >  	/*
> 

--Maik
Bjorn Helgaas May 29, 2019, 10:03 p.m. UTC | #3
[+cc Alex]

On Fri, May 24, 2019 at 05:31:18PM +0200, Maik Broemme wrote:
> The Intel PCI bridge on SuperMicro Atom C3xxx motherboards do not
> successfully complete a bus reset when used with certain child devices.
> After the reset, config accesses to the child may fail. If assigning
> such device via VFIO it will immediately fail with:
> 
>   vfio-pci 0000:01:00.0: Failed to return from FLR
>   vfio-pci 0000:01:00.0: timed out waiting for pending transaction;
>   performing function level reset anyway

I guess these messages are from v4.13 or earlier, since the "Failed to
return from FLR" text was removed by 821cdad5c46c ("PCI: Wait up to 60
seconds for device to become ready after FLR"), which appeared in
v4.14.

I suppose a current kernel would fail similarly, but could you try it?
I think a current kernel would give more informative messages like:

  not ready XXms after FLR, giving up
  not ready XXms after bus reset, giving up

I don't understand the connection here: the messages you quote are
related to FLR, but the quirk isn't related to FLR.  The quirk
prevents a secondary bus reset.  So is it the case that we try FLR
first, it fails, then we try a secondary bus reset (does this succeed?
you don't mention an error from it), and the device remains
unresponsive and VFIO assignment fails?

And with the quirk, I assume we still try FLR, and it still fails.
But we *don't* try a secondary bus reset, and the device magically
works?  That's confusing to me.

> Device will disappear from PCI device list:
> 
>   !!! Unknown header type 7f
>   Kernel driver in use: vfio-pci
>   Kernel modules: ddbridge
> 
> The attached patch will mark the root port as incapable of doing a
> bus level reset. After that all my tested devices survive a VFIO
> assignment and several VM reboot cycles.
> 
> Signed-off-by: Maik Broemme <mbroemme@libmpq.org>
> ---
>  drivers/pci/quirks.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0f16acc323c6..86cd42872708 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3433,6 +3433,13 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
>   */
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
>  
> +/*
> + * Root port on some SuperMicro Atom C3xxx motherboards do not successfully
> + * complete a bus reset when used with certain child devices. After the
> + * reset, config accesses to the child may fail.
> + */
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19a4, quirk_no_bus_reset);
> +
>  static void quirk_no_pm_reset(struct pci_dev *dev)
>  {
>  	/*
> -- 
> 2.21.0
Alex Williamson May 30, 2019, 1:49 a.m. UTC | #4
On Wed, 29 May 2019 17:03:07 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Alex]
> 
> On Fri, May 24, 2019 at 05:31:18PM +0200, Maik Broemme wrote:
> > The Intel PCI bridge on SuperMicro Atom C3xxx motherboards do not
> > successfully complete a bus reset when used with certain child devices.
> > After the reset, config accesses to the child may fail. If assigning
> > such device via VFIO it will immediately fail with:
> > 
> >   vfio-pci 0000:01:00.0: Failed to return from FLR
> >   vfio-pci 0000:01:00.0: timed out waiting for pending transaction;
> >   performing function level reset anyway  
> 
> I guess these messages are from v4.13 or earlier, since the "Failed to
> return from FLR" text was removed by 821cdad5c46c ("PCI: Wait up to 60
> seconds for device to become ready after FLR"), which appeared in
> v4.14.
> 
> I suppose a current kernel would fail similarly, but could you try it?
> I think a current kernel would give more informative messages like:
> 
>   not ready XXms after FLR, giving up
>   not ready XXms after bus reset, giving up
> 
> I don't understand the connection here: the messages you quote are
> related to FLR, but the quirk isn't related to FLR.  The quirk
> prevents a secondary bus reset.  So is it the case that we try FLR
> first, it fails, then we try a secondary bus reset (does this succeed?
> you don't mention an error from it), and the device remains
> unresponsive and VFIO assignment fails?
> 
> And with the quirk, I assume we still try FLR, and it still fails.
> But we *don't* try a secondary bus reset, and the device magically
> works?  That's confusing to me.

As a counter point, I found a system with this root port in our test
environment.  It's not ideal as this root port has a PCIe-to-PCI bridge
downstream of it with a Matrox graphics downstream of that.  I can't
use vfio-pci to reset this hierarchy, but I can use setpci, ex:

# lspci -nnvs 00:09.0
00:09.0 PCI bridge [0604]: Intel Corporation Device [8086:19a4] (rev 11) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 26
	Memory at 1080000000 (64-bit, non-prefetchable) [size=128K]
	Bus: primary=00, secondary=03, subordinate=04, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: 84000000-848fffff [size=9M]
	Prefetchable memory behind bridge: 0000000082000000-0000000083ffffff [size=32M]
# lspci -nnvs 03:00.0
03:00.0 PCI bridge [0604]: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge [104c:8231] (rev 03) (prog-if 00 [Normal decode])
	Flags: fast devsel
	Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
	I/O behind bridge: 00000000-00000fff [size=4K]
	Memory behind bridge: 00000000-000fffff [size=1M]
	Prefetchable memory behind bridge: 0000000000000000-00000000000fffff [size=1M]

(resources are reset from previous experiments)

# setpci -s 00:09.0 3e.w=40:40
# lspci -nnvs 03:00.0
03:00.0 PCI bridge [0604]: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge [104c:8231] (rev ff) (prog-if ff)
	!!! Unknown header type 7f

(bus in reset, config space unavailable, EXPECTED)

# setpci -s 00:09.0 3e.w=0:40
[root@intel-harrisonville-01 devices]# lspci -nnvs 03:00.0
03:00.0 PCI bridge [0604]: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge [104c:8231] (rev 03) (prog-if 00 [Normal decode])
	Flags: fast devsel
	Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
	I/O behind bridge: 00000000-00000fff [size=4K]
	Memory behind bridge: 00000000-000fffff [size=1M]
	Prefetchable memory behind bridge: 0000000000000000-00000000000fffff [size=1M]

(bus out of reset, downstream config space is available again)

I'm also confused about the description of this device:

On Fri, 24 May 2019 20:41:13 +0200 Maik Broemme <mbroemme@libmpq.org> wrote:
> Also I've tried a PCI-E switch from PLX technology, sold by MikroTik, the
> RouterBoard RB14eU. It is exports 4 Mini PCI ports in one PCI-E port and
> I tried it with one card and multiple cards.
> 
> All these devices start to work once I enabled the bus reset quirk. The
> RB14eU even allows to assign the individual Mini PCI-E ports to
> different VMs and survive independent resets behind the PLX bridge.

To me this describes a topology like:

[RP]---[US]-+-[DS]--[EP]
            +-[DS]--[EP]
            +-[DS]--[EP]
            \-[DS]--[EP]

(RootPort/UpstreamSwitch/DownstreamSwitch/EndPoint)

We can only assigned endpoints to VMs through vfio, therefore if we
need to reset the EP via a bus reset, that reset would occur at the
downstream switch point, not the root port.  It doesn't make sense that
a quirk at the RP would resolve anything about this use case.

Also, per the Intel datasheet, this is not the only root port in this
processor and presumably they'd all work the same way, so handling one
ID as a special case seems wrong regardless.  Thanks,

Alex

> > Device will disappear from PCI device list:
> > 
> >   !!! Unknown header type 7f
> >   Kernel driver in use: vfio-pci
> >   Kernel modules: ddbridge
> > 
> > The attached patch will mark the root port as incapable of doing a
> > bus level reset. After that all my tested devices survive a VFIO
> > assignment and several VM reboot cycles.
> > 
> > Signed-off-by: Maik Broemme <mbroemme@libmpq.org>
> > ---
> >  drivers/pci/quirks.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 0f16acc323c6..86cd42872708 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3433,6 +3433,13 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> >   */
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
> >  
> > +/*
> > + * Root port on some SuperMicro Atom C3xxx motherboards do not successfully
> > + * complete a bus reset when used with certain child devices. After the
> > + * reset, config accesses to the child may fail.
> > + */
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19a4, quirk_no_bus_reset);
> > +
> >  static void quirk_no_pm_reset(struct pci_dev *dev)
> >  {
> >  	/*
> > -- 
> > 2.21.0
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0f16acc323c6..86cd42872708 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3433,6 +3433,13 @@  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
  */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_no_bus_reset);
 
+/*
+ * Root port on some SuperMicro Atom C3xxx motherboards do not successfully
+ * complete a bus reset when used with certain child devices. After the
+ * reset, config accesses to the child may fail.
+ */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19a4, quirk_no_bus_reset);
+
 static void quirk_no_pm_reset(struct pci_dev *dev)
 {
 	/*