diff mbox

[v5,2/2] PCI: quirks: Fix ThunderX2 dma alias handling

Message ID 1492115445-4069-3-git-send-email-jnair@caviumnetworks.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jayachandran C April 13, 2017, 8:30 p.m. UTC
On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
PCI topology is slightly unusual.  For a multi-node system, it looks
like:

    00:00.0 [PCI] bridge to [bus 01-1e]
    01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
    02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
    03:00.0 PCIe Endpoint

pci_for_each_dma_alias() assumes IOMMU translation is done at the root
of the PCI hierarchy.  It generates 03:00.0, 01:0a.0, and 00:00.0 as
DMA aliases for 03:00.0 because buses 01 and 00 are non-PCIe buses that
doesn't carry the Requester ID.

Because the ThunderX2 IOMMU is at 02:00.0, the Requester IDs 01:0a.0
or 00:00.0 are never valid for the endpoint.  This quirk stops alias
generation at the XLATE_ROOT bridge so we won't generate 01:0a.0 or
00:00.0

The current IOMMU code only maps the last alias (this is a separate bug
in itself).  Prior to this quirk, we only created IOMMU mappings for the
invalid Requester ID 00:00:0, which never matched any DMA transactions.

With this quirk, we create IOMMU mappings for a valid Requester ID,
which fixes devices with no aliases but leaves devices with aliases
still broken.

The last alias for the endpoint is also used by the ARM GICv3 MSI-X
code.  Without this quirk, the GIC Interrupt Translation Tables are
setup with the invalid Requester ID, and the MSI-X generated by the
device fails to be translated and routed.

Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: David Daney <david.daney@cavium.com>
---
 drivers/pci/quirks.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Bjorn Helgaas April 14, 2017, 12:19 a.m. UTC | #1
I tentatively applied both patches to pci/host-thunder for v4.12.

However, I am concerned about the topology here:

On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
> PCI topology is slightly unusual.  For a multi-node system, it looks
> like:
> 
>     00:00.0 [PCI] bridge to [bus 01-1e]
>     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>     03:00.0 PCIe Endpoint

A root port normally has a single PCIe link leading downstream.
According to this, 02:00.0 is a root port that has the usual
downstream link leading to 03:00.0, but it also has an upstream link
to 01:0a.0.

Maybe this example is omitting details that are not relevant to DMA
aliases?  The PCIe capability only contains one set of link-related
registers, so I don't know how we could manage a single device that
has two links.

A device with two links would break things like ASPM.  In
set_pcie_port_type(), for example, we have this comment:

   * A Root Port or a PCI-to-PCIe bridge is always the upstream end
   * of a Link.  No PCIe component has two Links.  Two Links are
   * connected by a Switch that has a Port on each Link and internal
   * logic to connect the two Ports.

The topology above breaks these assumptions, which will make
pdev->has_secondary_link incorrect, which means ASPM won't work
correctly.

What am I missing?

Bjorn
Jayachandran C April 14, 2017, 9:06 p.m. UTC | #2
On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
> I tentatively applied both patches to pci/host-thunder for v4.12.
> 
> However, I am concerned about the topology here:
> 
> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
> > PCI topology is slightly unusual.  For a multi-node system, it looks
> > like:
> > 
> >     00:00.0 [PCI] bridge to [bus 01-1e]
> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
> >     03:00.0 PCIe Endpoint
> 
> A root port normally has a single PCIe link leading downstream.
> According to this, 02:00.0 is a root port that has the usual
> downstream link leading to 03:00.0, but it also has an upstream link
> to 01:0a.0.

The PCI topology is a bit broken due to the way that the PCIe IP block
was integrated into SoC PCI bridges and devices. The current mechanism
of adding a PCI-PCIe bridge to glue these together is not ideal.

> Maybe this example is omitting details that are not relevant to DMA
> aliases?  The PCIe capability only contains one set of link-related
> registers, so I don't know how we could manage a single device that
> has two links.
 
The root port is standard and has just one link to the EP (or whatever
is on the external PCIe slot). The fallout of the current hw design is
that the PCI-PCIe bridge has a link that does not follow standard and
does not have a counterpart (as you noted).

> A device with two links would break things like ASPM.  In
> set_pcie_port_type(), for example, we have this comment:
> 
>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
>    * of a Link.  No PCIe component has two Links.  Two Links are
>    * connected by a Switch that has a Port on each Link and internal
>    * logic to connect the two Ports.
> 
> The topology above breaks these assumptions, which will make
> pdev->has_secondary_link incorrect, which means ASPM won't work
> correctly.

Given the current hardware, the pcieport driver seems to work reasonably
for the root port at 02:00.0, with AER support. I will take a look at the
ASPM part.

Thanks,
JC.
Bjorn Helgaas April 15, 2017, 2 a.m. UTC | #3
On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
<jnair@caviumnetworks.com> wrote:
> On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
>> I tentatively applied both patches to pci/host-thunder for v4.12.
>>
>> However, I am concerned about the topology here:
>>
>> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
>> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>> > PCI topology is slightly unusual.  For a multi-node system, it looks
>> > like:
>> >
>> >     00:00.0 [PCI] bridge to [bus 01-1e]
>> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>> >     03:00.0 PCIe Endpoint
>>
>> A root port normally has a single PCIe link leading downstream.
>> According to this, 02:00.0 is a root port that has the usual
>> downstream link leading to 03:00.0, but it also has an upstream link
>> to 01:0a.0.
>
> The PCI topology is a bit broken due to the way that the PCIe IP block
> was integrated into SoC PCI bridges and devices. The current mechanism
> of adding a PCI-PCIe bridge to glue these together is not ideal.

Yeah, that's definitely broken.

>> Maybe this example is omitting details that are not relevant to DMA
>> aliases?  The PCIe capability only contains one set of link-related
>> registers, so I don't know how we could manage a single device that
>> has two links.
>
> The root port is standard and has just one link to the EP (or whatever
> is on the external PCIe slot). The fallout of the current hw design is
> that the PCI-PCIe bridge has a link that does not follow standard and
> does not have a counterpart (as you noted).
>
>> A device with two links would break things like ASPM.  In
>> set_pcie_port_type(), for example, we have this comment:
>>
>>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
>>    * of a Link.  No PCIe component has two Links.  Two Links are
>>    * connected by a Switch that has a Port on each Link and internal
>>    * logic to connect the two Ports.
>>
>> The topology above breaks these assumptions, which will make
>> pdev->has_secondary_link incorrect, which means ASPM won't work
>> correctly.
>
> Given the current hardware, the pcieport driver seems to work reasonably
> for the root port at 02:00.0, with AER support. I will take a look at the
> ASPM part.

I don't think pcieport itself cares much about links.  ASPM does, but
it looks like set_pcie_port_type() actually is smart enough to know
that PCI-to-PCIe bridges and Root Ports always have links on their
secondary sides.  So has_secondary_link probably does get set
correctly.

But I think you'll hit the VIA "strange chipset" thing in
pcie_aspm_init_link_state(), which will probably prevent us from doing
ASPM on the link from 02:00.0 to 03:00.0.

Could you collect "lspci -vv" output from this system?  I'd like to
archive that as background for this IOMMU issue and the ASPM tweaks I
suspect we'll have to do.  I *wish* we had more information about that
VIA thing, because I suspect we could get rid of it if we had more
details.

Bjorn
Jayachandran C April 17, 2017, 5:47 p.m. UTC | #4
On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
> <jnair@caviumnetworks.com> wrote:
> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
> >> I tentatively applied both patches to pci/host-thunder for v4.12.
> >>
> >> However, I am concerned about the topology here:
> >>
> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
> >> > like:
> >> >
> >> >     00:00.0 [PCI] bridge to [bus 01-1e]
> >> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
> >> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
> >> >     03:00.0 PCIe Endpoint
> >>
> >> A root port normally has a single PCIe link leading downstream.
> >> According to this, 02:00.0 is a root port that has the usual
> >> downstream link leading to 03:00.0, but it also has an upstream link
> >> to 01:0a.0.
> >
> > The PCI topology is a bit broken due to the way that the PCIe IP block
> > was integrated into SoC PCI bridges and devices. The current mechanism
> > of adding a PCI-PCIe bridge to glue these together is not ideal.
> 
> Yeah, that's definitely broken.
> 
> >> Maybe this example is omitting details that are not relevant to DMA
> >> aliases?  The PCIe capability only contains one set of link-related
> >> registers, so I don't know how we could manage a single device that
> >> has two links.
> >
> > The root port is standard and has just one link to the EP (or whatever
> > is on the external PCIe slot). The fallout of the current hw design is
> > that the PCI-PCIe bridge has a link that does not follow standard and
> > does not have a counterpart (as you noted).
> >
> >> A device with two links would break things like ASPM.  In
> >> set_pcie_port_type(), for example, we have this comment:
> >>
> >>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
> >>    * of a Link.  No PCIe component has two Links.  Two Links are
> >>    * connected by a Switch that has a Port on each Link and internal
> >>    * logic to connect the two Ports.
> >>
> >> The topology above breaks these assumptions, which will make
> >> pdev->has_secondary_link incorrect, which means ASPM won't work
> >> correctly.
> >
> > Given the current hardware, the pcieport driver seems to work reasonably
> > for the root port at 02:00.0, with AER support. I will take a look at the
> > ASPM part.
> 
> I don't think pcieport itself cares much about links.  ASPM does, but
> it looks like set_pcie_port_type() actually is smart enough to know
> that PCI-to-PCIe bridges and Root Ports always have links on their
> secondary sides.  So has_secondary_link probably does get set
> correctly.
> 
> But I think you'll hit the VIA "strange chipset" thing in
> pcie_aspm_init_link_state(), which will probably prevent us from doing
> ASPM on the link from 02:00.0 to 03:00.0.
> 
> Could you collect "lspci -vv" output from this system?  I'd like to
> archive that as background for this IOMMU issue and the ASPM tweaks I
> suspect we'll have to do.  I *wish* we had more information about that
> VIA thing, because I suspect we could get rid of it if we had more
> details.

The full logs are slightly large, so I have kept them at:
https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt

The output is from 2 socket system, the cards are not on the first slot
like the example above, so the bus and device numbers are different.

Looks like I have to spend some time on ASPM next.

Thanks,
JC.
Bjorn Helgaas April 17, 2017, 7:51 p.m. UTC | #5
On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
<jnair@caviumnetworks.com> wrote:
> On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
>> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
>> <jnair@caviumnetworks.com> wrote:
>> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
>> >> I tentatively applied both patches to pci/host-thunder for v4.12.
>> >>
>> >> However, I am concerned about the topology here:
>> >>
>> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
>> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
>> >> > like:
>> >> >
>> >> >     00:00.0 [PCI] bridge to [bus 01-1e]
>> >> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>> >> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>> >> >     03:00.0 PCIe Endpoint
>> >>
>> >> A root port normally has a single PCIe link leading downstream.
>> >> According to this, 02:00.0 is a root port that has the usual
>> >> downstream link leading to 03:00.0, but it also has an upstream link
>> >> to 01:0a.0.
>> >
>> > The PCI topology is a bit broken due to the way that the PCIe IP block
>> > was integrated into SoC PCI bridges and devices. The current mechanism
>> > of adding a PCI-PCIe bridge to glue these together is not ideal.
>>
>> Yeah, that's definitely broken.
>>
>> >> Maybe this example is omitting details that are not relevant to DMA
>> >> aliases?  The PCIe capability only contains one set of link-related
>> >> registers, so I don't know how we could manage a single device that
>> >> has two links.
>> >
>> > The root port is standard and has just one link to the EP (or whatever
>> > is on the external PCIe slot). The fallout of the current hw design is
>> > that the PCI-PCIe bridge has a link that does not follow standard and
>> > does not have a counterpart (as you noted).
>> >
>> >> A device with two links would break things like ASPM.  In
>> >> set_pcie_port_type(), for example, we have this comment:
>> >>
>> >>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
>> >>    * of a Link.  No PCIe component has two Links.  Two Links are
>> >>    * connected by a Switch that has a Port on each Link and internal
>> >>    * logic to connect the two Ports.
>> >>
>> >> The topology above breaks these assumptions, which will make
>> >> pdev->has_secondary_link incorrect, which means ASPM won't work
>> >> correctly.
>> >
>> > Given the current hardware, the pcieport driver seems to work reasonably
>> > for the root port at 02:00.0, with AER support. I will take a look at the
>> > ASPM part.
>>
>> I don't think pcieport itself cares much about links.  ASPM does, but
>> it looks like set_pcie_port_type() actually is smart enough to know
>> that PCI-to-PCIe bridges and Root Ports always have links on their
>> secondary sides.  So has_secondary_link probably does get set
>> correctly.
>>
>> But I think you'll hit the VIA "strange chipset" thing in
>> pcie_aspm_init_link_state(), which will probably prevent us from doing
>> ASPM on the link from 02:00.0 to 03:00.0.
>>
>> Could you collect "lspci -vv" output from this system?  I'd like to
>> archive that as background for this IOMMU issue and the ASPM tweaks I
>> suspect we'll have to do.  I *wish* we had more information about that
>> VIA thing, because I suspect we could get rid of it if we had more
>> details.
>
> The full logs are slightly large, so I have kept them at:
> https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
>
> The output is from 2 socket system, the cards are not on the first slot
> like the example above, so the bus and device numbers are different.
>
> Looks like I have to spend some time on ASPM next.

Thanks, I attached these to
https://bugzilla.kernel.org/show_bug.cgi?id=195447 and added that link
to the changelogs.

  01:0a.0 PCI-to-PCIe bridge to [bus 02-03]
    Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge

lspci doesn't decode the "Slot Implemented" bit here.  The spec (PCIe
r3.1, sec 7.8.2) isn't explicit about whether that bit is defined for
this kind of bridge, but it seems to me like this bridge contains a
Downstream Port that could lead to a slot, so we *should* decode "Slot
Implemented", and if it does indicate a slot, we should decode the
Slot Capabilities, Control, and Status registers as well.

Linux also doesn't currently believe this bridge can have a slot below
it (see pcie_cap_has_sltctl() and pcie_downstream_port()).  I don't
know if your topology has actual slots there, but I think the spec
does allow it, so Linux probably should handle that.

For this port:

  02:00.0 Root Port to [bus 03]
    Capabilities: [ac] Express (v2) Root Port (Slot-)

I'm pretty sure there *is* a slot (currently empty), and your lspci
output shows "Slot-", which seems wrong to me.  It should show "Slot+"
with Presence Detect State showing "Slot Empty", shouldn't it?

Bjorn
Jon Masters April 19, 2017, 11:38 p.m. UTC | #6
Hi Bjorn, JC,

On 04/13/2017 08:19 PM, Bjorn Helgaas wrote:

> I tentatively applied both patches to pci/host-thunder for v4.12.

Thanks for that :)

> However, I am concerned about the topology here:

Various feedback has been provided on this one over the past $time. In
addition, I have requested that this serve as an example of why we need
a more complete PCIe integration guide for ARMv8. It's on the list of
things for my intended opus magnum on the topic ;)

> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
>> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>> PCI topology is slightly unusual.  For a multi-node system, it looks
>> like:
>>
>>     00:00.0 [PCI] bridge to [bus 01-1e]
>>     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>>     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>>     03:00.0 PCIe Endpoint
> 
> A root port normally has a single PCIe link leading downstream.

In integration terms, there's always a bus on the other side of the RC.
It's just usually a processor local bus of some kind on a proprietary
interconnect, but there's always something there. The issue is when you
can see it as PCIe and it's not through a transparent glue bridge.

I had originally hoped that the ECAM could be hacked up so that we would
first walk the topology at the 02:00.0 as a root and not see what's
above it BUT there are other device attachments up there for the on-SoC
devices and I think we really intend to use those.

Bottom line in my opinion is document this case, use it as a learning
example, and move forward. This has been useful in justifying why we
need better integration documentation from the server community. And in
particular from the OS vendors, of which I guess we can allude to their
being more than Linux interested in ARM server these days ;)

Jon.
Jon Masters April 20, 2017, 12:25 a.m. UTC | #7
One additional footnote. I spent a bunch of time recently on my personal
Xeon systems walking through the PCIe topology and aligning on how to
advise the ARM server community proceed going forward. If you look at
how Intel vs AMD handle their host bridges for example, you'll see two
very different approaches to the case of cross-socket PCIe. But my
operating assumption is that anything longer term which looks boring and
x86 enough is probably fine from an ARM server point of view.

On 04/19/2017 07:38 PM, Jon Masters wrote:
> Hi Bjorn, JC,
> 
> On 04/13/2017 08:19 PM, Bjorn Helgaas wrote:
> 
>> I tentatively applied both patches to pci/host-thunder for v4.12.
> 
> Thanks for that :)
> 
>> However, I am concerned about the topology here:
> 
> Various feedback has been provided on this one over the past $time. In
> addition, I have requested that this serve as an example of why we need
> a more complete PCIe integration guide for ARMv8. It's on the list of
> things for my intended opus magnum on the topic ;)
> 
>> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
>>> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>>> PCI topology is slightly unusual.  For a multi-node system, it looks
>>> like:
>>>
>>>     00:00.0 [PCI] bridge to [bus 01-1e]
>>>     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>>>     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>>>     03:00.0 PCIe Endpoint
>>
>> A root port normally has a single PCIe link leading downstream.
> 
> In integration terms, there's always a bus on the other side of the RC.
> It's just usually a processor local bus of some kind on a proprietary
> interconnect, but there's always something there. The issue is when you
> can see it as PCIe and it's not through a transparent glue bridge.
> 
> I had originally hoped that the ECAM could be hacked up so that we would
> first walk the topology at the 02:00.0 as a root and not see what's
> above it BUT there are other device attachments up there for the on-SoC
> devices and I think we really intend to use those.
> 
> Bottom line in my opinion is document this case, use it as a learning
> example, and move forward. This has been useful in justifying why we
> need better integration documentation from the server community. And in
> particular from the OS vendors, of which I guess we can allude to their
> being more than Linux interested in ARM server these days ;)
> 
> Jon.
>
Bjorn Helgaas April 20, 2017, 1:20 p.m. UTC | #8
On Wed, Apr 19, 2017 at 7:25 PM, Jon Masters <jcm@redhat.com> wrote:
> One additional footnote. I spent a bunch of time recently on my personal
> Xeon systems walking through the PCIe topology and aligning on how to
> advise the ARM server community proceed going forward. If you look at
> how Intel vs AMD handle their host bridges for example, you'll see two
> very different approaches to the case of cross-socket PCIe.

As a learning opportunity for me, can you share "lspci -vv" examples
that show this Intel vs AMD difference?  Maybe the ACPI host bridge
descriptions from dmesg are relevant too?

> But my
> operating assumption is that anything longer term which looks boring and
> x86 enough is probably fine from an ARM server point of view.

That sounds pretty safe to me.

Bjorn
Bjorn Helgaas April 21, 2017, 3:48 p.m. UTC | #9
On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
<jnair@caviumnetworks.com> wrote:
> On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
>> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
>> <jnair@caviumnetworks.com> wrote:
>> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
>> >> I tentatively applied both patches to pci/host-thunder for v4.12.
>> >>
>> >> However, I am concerned about the topology here:
>> >>
>> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
>> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
>> >> > like:
>> >> >
>> >> >     00:00.0 [PCI] bridge to [bus 01-1e]
>> >> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>> >> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>> >> >     03:00.0 PCIe Endpoint
>> >>
>> >> A root port normally has a single PCIe link leading downstream.
>> >> According to this, 02:00.0 is a root port that has the usual
>> >> downstream link leading to 03:00.0, but it also has an upstream link
>> >> to 01:0a.0.
>> >
>> > The PCI topology is a bit broken due to the way that the PCIe IP block
>> > was integrated into SoC PCI bridges and devices. The current mechanism
>> > of adding a PCI-PCIe bridge to glue these together is not ideal.
>>
>> Yeah, that's definitely broken.
>>
>> >> Maybe this example is omitting details that are not relevant to DMA
>> >> aliases?  The PCIe capability only contains one set of link-related
>> >> registers, so I don't know how we could manage a single device that
>> >> has two links.
>> >
>> > The root port is standard and has just one link to the EP (or whatever
>> > is on the external PCIe slot). The fallout of the current hw design is
>> > that the PCI-PCIe bridge has a link that does not follow standard and
>> > does not have a counterpart (as you noted).
>> >
>> >> A device with two links would break things like ASPM.  In
>> >> set_pcie_port_type(), for example, we have this comment:
>> >>
>> >>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
>> >>    * of a Link.  No PCIe component has two Links.  Two Links are
>> >>    * connected by a Switch that has a Port on each Link and internal
>> >>    * logic to connect the two Ports.
>> >>
>> >> The topology above breaks these assumptions, which will make
>> >> pdev->has_secondary_link incorrect, which means ASPM won't work
>> >> correctly.
>> >
>> > Given the current hardware, the pcieport driver seems to work reasonably
>> > for the root port at 02:00.0, with AER support. I will take a look at the
>> > ASPM part.
>>
>> I don't think pcieport itself cares much about links.  ASPM does, but
>> it looks like set_pcie_port_type() actually is smart enough to know
>> that PCI-to-PCIe bridges and Root Ports always have links on their
>> secondary sides.  So has_secondary_link probably does get set
>> correctly.
>>
>> But I think you'll hit the VIA "strange chipset" thing in
>> pcie_aspm_init_link_state(), which will probably prevent us from doing
>> ASPM on the link from 02:00.0 to 03:00.0.
>>
>> Could you collect "lspci -vv" output from this system?  I'd like to
>> archive that as background for this IOMMU issue and the ASPM tweaks I
>> suspect we'll have to do.  I *wish* we had more information about that
>> VIA thing, because I suspect we could get rid of it if we had more
>> details.
>
> The full logs are slightly large, so I have kept them at:
> https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
>
> The output is from 2 socket system, the cards are not on the first slot
> like the example above, so the bus and device numbers are different.

Can somebody with this system collect the "lspci -xxxx" output as well?

I'm making some lspci changes to handle the PCI-to-PCIe bridge
correctly, and I can use the "lspci -xxxx" output to create an lspci
test case.
Jayachandran C April 21, 2017, 5:05 p.m. UTC | #10
On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote:
> On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
> <jnair@caviumnetworks.com> wrote:
> > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
> >> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
> >> <jnair@caviumnetworks.com> wrote:
> >> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
> >> >> I tentatively applied both patches to pci/host-thunder for v4.12.
> >> >>
> >> >> However, I am concerned about the topology here:
> >> >>
> >> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote:
> >> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
> >> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
> >> >> > like:
> >> >> >
> >> >> >     00:00.0 [PCI] bridge to [bus 01-1e]
> >> >> >     01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
> >> >> >     02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
> >> >> >     03:00.0 PCIe Endpoint
> >> >>
> >> >> A root port normally has a single PCIe link leading downstream.
> >> >> According to this, 02:00.0 is a root port that has the usual
> >> >> downstream link leading to 03:00.0, but it also has an upstream link
> >> >> to 01:0a.0.
> >> >
> >> > The PCI topology is a bit broken due to the way that the PCIe IP block
> >> > was integrated into SoC PCI bridges and devices. The current mechanism
> >> > of adding a PCI-PCIe bridge to glue these together is not ideal.
> >>
> >> Yeah, that's definitely broken.
> >>
> >> >> Maybe this example is omitting details that are not relevant to DMA
> >> >> aliases?  The PCIe capability only contains one set of link-related
> >> >> registers, so I don't know how we could manage a single device that
> >> >> has two links.
> >> >
> >> > The root port is standard and has just one link to the EP (or whatever
> >> > is on the external PCIe slot). The fallout of the current hw design is
> >> > that the PCI-PCIe bridge has a link that does not follow standard and
> >> > does not have a counterpart (as you noted).
> >> >
> >> >> A device with two links would break things like ASPM.  In
> >> >> set_pcie_port_type(), for example, we have this comment:
> >> >>
> >> >>    * A Root Port or a PCI-to-PCIe bridge is always the upstream end
> >> >>    * of a Link.  No PCIe component has two Links.  Two Links are
> >> >>    * connected by a Switch that has a Port on each Link and internal
> >> >>    * logic to connect the two Ports.
> >> >>
> >> >> The topology above breaks these assumptions, which will make
> >> >> pdev->has_secondary_link incorrect, which means ASPM won't work
> >> >> correctly.
> >> >
> >> > Given the current hardware, the pcieport driver seems to work reasonably
> >> > for the root port at 02:00.0, with AER support. I will take a look at the
> >> > ASPM part.
> >>
> >> I don't think pcieport itself cares much about links.  ASPM does, but
> >> it looks like set_pcie_port_type() actually is smart enough to know
> >> that PCI-to-PCIe bridges and Root Ports always have links on their
> >> secondary sides.  So has_secondary_link probably does get set
> >> correctly.
> >>
> >> But I think you'll hit the VIA "strange chipset" thing in
> >> pcie_aspm_init_link_state(), which will probably prevent us from doing
> >> ASPM on the link from 02:00.0 to 03:00.0.
> >>
> >> Could you collect "lspci -vv" output from this system?  I'd like to
> >> archive that as background for this IOMMU issue and the ASPM tweaks I
> >> suspect we'll have to do.  I *wish* we had more information about that
> >> VIA thing, because I suspect we could get rid of it if we had more
> >> details.
> >
> > The full logs are slightly large, so I have kept them at:
> > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
> >
> > The output is from 2 socket system, the cards are not on the first slot
> > like the example above, so the bus and device numbers are different.
> 
> Can somebody with this system collect the "lspci -xxxx" output as well?
> 
> I'm making some lspci changes to handle the PCI-to-PCIe bridge
> correctly, and I can use the "lspci -xxxx" output to create an lspci
> test case.

[Sorry was AFK for a few days]

I have updated the above directory with the log. Also tested your next branch
and it works fine on ThunderX2.

JC.
Bjorn Helgaas April 21, 2017, 5:57 p.m. UTC | #11
On Fri, Apr 21, 2017 at 05:05:41PM +0000, Jayachandran C wrote:
> On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
> > <jnair@caviumnetworks.com> wrote:
> > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:

> > >> Could you collect "lspci -vv" output from this system?  I'd like to
> > >> archive that as background for this IOMMU issue and the ASPM tweaks I
> > >> suspect we'll have to do.  I *wish* we had more information about that
> > >> VIA thing, because I suspect we could get rid of it if we had more
> > >> details.
> > >
> > > The full logs are slightly large, so I have kept them at:
> > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
> > >
> > > The output is from 2 socket system, the cards are not on the first slot
> > > like the example above, so the bus and device numbers are different.
> > 
> > Can somebody with this system collect the "lspci -xxxx" output as well?
> > 
> > I'm making some lspci changes to handle the PCI-to-PCIe bridge
> > correctly, and I can use the "lspci -xxxx" output to create an lspci
> > test case.
> 
> [Sorry was AFK for a few days]
> 
> I have updated the above directory with the log. Also tested your next branch
> and it works fine on ThunderX2.

Thanks!

With regard to my lspci changes, they add "Slot-" here:

   01:0a.0 PCI bridge: Broadcom Corporation Device 9039
   ...
  -   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge, MSI 00
  +   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge (Slot-), MSI 00

for all your PCI-to-PCIe bridges.  I assume the "Slot-" is correct, i.e.,
the link is not connected to a slot, right?  This comes from the "Slot
Implemented" bit in the PCIe Capabilities Register.

I did notice that all the Root Port devices claim to *not* be connected to
slots, which doesn't seem right.  For example,

  12:00.0 PCI bridge: Broadcom Corporation Device 9084
      Bus: primary=12, secondary=13, subordinate=14, sec-latency=0
      Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00

  13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection

It seems strange because the 12:00.0 Root Port looks like it probably
*does* lead to a slot where the NIC is plugged in.  Or is that NIC really
soldered down?

But I assume there are *some* PCIe slots, so at some of those Root Ports
should advertise "Slot+" (which by itself does not imply hotplug support,
if that's the concern).

Bjorn
Jayachandran C April 25, 2017, 1:03 p.m. UTC | #12
On Fri, Apr 21, 2017 at 12:57:05PM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 21, 2017 at 05:05:41PM +0000, Jayachandran C wrote:
> > On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
> > > <jnair@caviumnetworks.com> wrote:
> > > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
> 
> > > >> Could you collect "lspci -vv" output from this system?  I'd like to
> > > >> archive that as background for this IOMMU issue and the ASPM tweaks I
> > > >> suspect we'll have to do.  I *wish* we had more information about that
> > > >> VIA thing, because I suspect we could get rid of it if we had more
> > > >> details.
> > > >
> > > > The full logs are slightly large, so I have kept them at:
> > > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> > > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
> > > >
> > > > The output is from 2 socket system, the cards are not on the first slot
> > > > like the example above, so the bus and device numbers are different.
> > > 
> > > Can somebody with this system collect the "lspci -xxxx" output as well?
> > > 
> > > I'm making some lspci changes to handle the PCI-to-PCIe bridge
> > > correctly, and I can use the "lspci -xxxx" output to create an lspci
> > > test case.
> > 
> > [Sorry was AFK for a few days]
> > 
> > I have updated the above directory with the log. Also tested your next branch
> > and it works fine on ThunderX2.
> 
> Thanks!
> 
> With regard to my lspci changes, they add "Slot-" here:
> 
>    01:0a.0 PCI bridge: Broadcom Corporation Device 9039
>    ...
>   -   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge, MSI 00
>   +   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge (Slot-), MSI 00
> 
> for all your PCI-to-PCIe bridges.  I assume the "Slot-" is correct, i.e.,
> the link is not connected to a slot, right?  This comes from the "Slot
> Implemented" bit in the PCIe Capabilities Register.

Yes, Slot- should be correct.
 
> I did notice that all the Root Port devices claim to *not* be connected to
> slots, which doesn't seem right.  For example,
> 
>   12:00.0 PCI bridge: Broadcom Corporation Device 9084
>       Bus: primary=12, secondary=13, subordinate=14, sec-latency=0
>       Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
> 
>   13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
> 
> It seems strange because the 12:00.0 Root Port looks like it probably
> *does* lead to a slot where the NIC is plugged in.  Or is that NIC really
> soldered down?
> 
> But I assume there are *some* PCIe slots, so at some of those Root Ports
> should advertise "Slot+" (which by itself does not imply hotplug support,
> if that's the concern).

The Root Ports are connected to a slot, so I am not sure why the slot implemented
bit is not set. There seems to be nothing useful in the slot capabilites, so this
may be ok for now. I have reported this to the hardware team.

Thanks for the review and the comments,
JC.
Bjorn Helgaas April 25, 2017, 1:37 p.m. UTC | #13
On Tue, Apr 25, 2017 at 8:03 AM, Jayachandran C
<jnair@caviumnetworks.com> wrote:
> On Fri, Apr 21, 2017 at 12:57:05PM -0500, Bjorn Helgaas wrote:

>> I did notice that all the Root Port devices claim to *not* be connected to
>> slots, which doesn't seem right.  For example,
>>
>>   12:00.0 PCI bridge: Broadcom Corporation Device 9084
>>       Bus: primary=12, secondary=13, subordinate=14, sec-latency=0
>>       Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
>>
>>   13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
>>
>> It seems strange because the 12:00.0 Root Port looks like it probably
>> *does* lead to a slot where the NIC is plugged in.  Or is that NIC really
>> soldered down?
>>
>> But I assume there are *some* PCIe slots, so at some of those Root Ports
>> should advertise "Slot+" (which by itself does not imply hotplug support,
>> if that's the concern).
>
> The Root Ports are connected to a slot, so I am not sure why the slot implemented
> bit is not set. There seems to be nothing useful in the slot capabilites, so this
> may be ok for now. I have reported this to the hardware team.

Thanks.  The "Slot Implemented" bit and the slot registers aren't
essential if you don't want to support hotplug on those slots.  But
even without hotplug, they do contain things like a slot number, which
may be useful in the user interface.
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6736836..564a84a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3958,6 +3958,20 @@  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
 
 /*
+ * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
+ * associated not at the root bus, but at a bridge below. This quirk flag
+ * will ensure that the aliases are identified correctly.
+ */
+static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
+{
+	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
+				quirk_bridge_cavm_thrx2_pcie_root);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
+				quirk_bridge_cavm_thrx2_pcie_root);
+
+/*
  * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
  * class code.  Fix it.
  */