Message ID | 1492115445-4069-3-git-send-email-jnair@caviumnetworks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
I tentatively applied both patches to pci/host-thunder for v4.12. However, I am concerned about the topology here: On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the > PCI topology is slightly unusual. For a multi-node system, it looks > like: > > 00:00.0 [PCI] bridge to [bus 01-1e] > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) > 03:00.0 PCIe Endpoint A root port normally has a single PCIe link leading downstream. According to this, 02:00.0 is a root port that has the usual downstream link leading to 03:00.0, but it also has an upstream link to 01:0a.0. Maybe this example is omitting details that are not relevant to DMA aliases? The PCIe capability only contains one set of link-related registers, so I don't know how we could manage a single device that has two links. A device with two links would break things like ASPM. In set_pcie_port_type(), for example, we have this comment: * A Root Port or a PCI-to-PCIe bridge is always the upstream end * of a Link. No PCIe component has two Links. Two Links are * connected by a Switch that has a Port on each Link and internal * logic to connect the two Ports. The topology above breaks these assumptions, which will make pdev->has_secondary_link incorrect, which means ASPM won't work correctly. What am I missing? Bjorn
On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: > I tentatively applied both patches to pci/host-thunder for v4.12. > > However, I am concerned about the topology here: > > On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: > > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the > > PCI topology is slightly unusual. For a multi-node system, it looks > > like: > > > > 00:00.0 [PCI] bridge to [bus 01-1e] > > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] > > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) > > 03:00.0 PCIe Endpoint > > A root port normally has a single PCIe link leading downstream. > According to this, 02:00.0 is a root port that has the usual > downstream link leading to 03:00.0, but it also has an upstream link > to 01:0a.0. The PCI topology is a bit broken due to the way that the PCIe IP block was integrated into SoC PCI bridges and devices. The current mechanism of adding a PCI-PCIe bridge to glue these together is not ideal. > Maybe this example is omitting details that are not relevant to DMA > aliases? The PCIe capability only contains one set of link-related > registers, so I don't know how we could manage a single device that > has two links. The root port is standard and has just one link to the EP (or whatever is on the external PCIe slot). The fallout of the current hw design is that the PCI-PCIe bridge has a link that does not follow standard and does not have a counterpart (as you noted). > A device with two links would break things like ASPM. In > set_pcie_port_type(), for example, we have this comment: > > * A Root Port or a PCI-to-PCIe bridge is always the upstream end > * of a Link. No PCIe component has two Links. Two Links are > * connected by a Switch that has a Port on each Link and internal > * logic to connect the two Ports. > > The topology above breaks these assumptions, which will make > pdev->has_secondary_link incorrect, which means ASPM won't work > correctly. Given the current hardware, the pcieport driver seems to work reasonably for the root port at 02:00.0, with AER support. I will take a look at the ASPM part. Thanks, JC.
On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C <jnair@caviumnetworks.com> wrote: > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: >> I tentatively applied both patches to pci/host-thunder for v4.12. >> >> However, I am concerned about the topology here: >> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >> > PCI topology is slightly unusual. For a multi-node system, it looks >> > like: >> > >> > 00:00.0 [PCI] bridge to [bus 01-1e] >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >> > 03:00.0 PCIe Endpoint >> >> A root port normally has a single PCIe link leading downstream. >> According to this, 02:00.0 is a root port that has the usual >> downstream link leading to 03:00.0, but it also has an upstream link >> to 01:0a.0. > > The PCI topology is a bit broken due to the way that the PCIe IP block > was integrated into SoC PCI bridges and devices. The current mechanism > of adding a PCI-PCIe bridge to glue these together is not ideal. Yeah, that's definitely broken. >> Maybe this example is omitting details that are not relevant to DMA >> aliases? The PCIe capability only contains one set of link-related >> registers, so I don't know how we could manage a single device that >> has two links. > > The root port is standard and has just one link to the EP (or whatever > is on the external PCIe slot). The fallout of the current hw design is > that the PCI-PCIe bridge has a link that does not follow standard and > does not have a counterpart (as you noted). > >> A device with two links would break things like ASPM. In >> set_pcie_port_type(), for example, we have this comment: >> >> * A Root Port or a PCI-to-PCIe bridge is always the upstream end >> * of a Link. No PCIe component has two Links. Two Links are >> * connected by a Switch that has a Port on each Link and internal >> * logic to connect the two Ports. >> >> The topology above breaks these assumptions, which will make >> pdev->has_secondary_link incorrect, which means ASPM won't work >> correctly. > > Given the current hardware, the pcieport driver seems to work reasonably > for the root port at 02:00.0, with AER support. I will take a look at the > ASPM part. I don't think pcieport itself cares much about links. ASPM does, but it looks like set_pcie_port_type() actually is smart enough to know that PCI-to-PCIe bridges and Root Ports always have links on their secondary sides. So has_secondary_link probably does get set correctly. But I think you'll hit the VIA "strange chipset" thing in pcie_aspm_init_link_state(), which will probably prevent us from doing ASPM on the link from 02:00.0 to 03:00.0. Could you collect "lspci -vv" output from this system? I'd like to archive that as background for this IOMMU issue and the ASPM tweaks I suspect we'll have to do. I *wish* we had more information about that VIA thing, because I suspect we could get rid of it if we had more details. Bjorn
On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: > On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C > <jnair@caviumnetworks.com> wrote: > > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: > >> I tentatively applied both patches to pci/host-thunder for v4.12. > >> > >> However, I am concerned about the topology here: > >> > >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: > >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the > >> > PCI topology is slightly unusual. For a multi-node system, it looks > >> > like: > >> > > >> > 00:00.0 [PCI] bridge to [bus 01-1e] > >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] > >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) > >> > 03:00.0 PCIe Endpoint > >> > >> A root port normally has a single PCIe link leading downstream. > >> According to this, 02:00.0 is a root port that has the usual > >> downstream link leading to 03:00.0, but it also has an upstream link > >> to 01:0a.0. > > > > The PCI topology is a bit broken due to the way that the PCIe IP block > > was integrated into SoC PCI bridges and devices. The current mechanism > > of adding a PCI-PCIe bridge to glue these together is not ideal. > > Yeah, that's definitely broken. > > >> Maybe this example is omitting details that are not relevant to DMA > >> aliases? The PCIe capability only contains one set of link-related > >> registers, so I don't know how we could manage a single device that > >> has two links. > > > > The root port is standard and has just one link to the EP (or whatever > > is on the external PCIe slot). The fallout of the current hw design is > > that the PCI-PCIe bridge has a link that does not follow standard and > > does not have a counterpart (as you noted). > > > >> A device with two links would break things like ASPM. In > >> set_pcie_port_type(), for example, we have this comment: > >> > >> * A Root Port or a PCI-to-PCIe bridge is always the upstream end > >> * of a Link. No PCIe component has two Links. Two Links are > >> * connected by a Switch that has a Port on each Link and internal > >> * logic to connect the two Ports. > >> > >> The topology above breaks these assumptions, which will make > >> pdev->has_secondary_link incorrect, which means ASPM won't work > >> correctly. > > > > Given the current hardware, the pcieport driver seems to work reasonably > > for the root port at 02:00.0, with AER support. I will take a look at the > > ASPM part. > > I don't think pcieport itself cares much about links. ASPM does, but > it looks like set_pcie_port_type() actually is smart enough to know > that PCI-to-PCIe bridges and Root Ports always have links on their > secondary sides. So has_secondary_link probably does get set > correctly. > > But I think you'll hit the VIA "strange chipset" thing in > pcie_aspm_init_link_state(), which will probably prevent us from doing > ASPM on the link from 02:00.0 to 03:00.0. > > Could you collect "lspci -vv" output from this system? I'd like to > archive that as background for this IOMMU issue and the ASPM tweaks I > suspect we'll have to do. I *wish* we had more information about that > VIA thing, because I suspect we could get rid of it if we had more > details. The full logs are slightly large, so I have kept them at: https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt The output is from 2 socket system, the cards are not on the first slot like the example above, so the bus and device numbers are different. Looks like I have to spend some time on ASPM next. Thanks, JC.
On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C <jnair@caviumnetworks.com> wrote: > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: >> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C >> <jnair@caviumnetworks.com> wrote: >> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: >> >> I tentatively applied both patches to pci/host-thunder for v4.12. >> >> >> >> However, I am concerned about the topology here: >> >> >> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: >> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >> >> > PCI topology is slightly unusual. For a multi-node system, it looks >> >> > like: >> >> > >> >> > 00:00.0 [PCI] bridge to [bus 01-1e] >> >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >> >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >> >> > 03:00.0 PCIe Endpoint >> >> >> >> A root port normally has a single PCIe link leading downstream. >> >> According to this, 02:00.0 is a root port that has the usual >> >> downstream link leading to 03:00.0, but it also has an upstream link >> >> to 01:0a.0. >> > >> > The PCI topology is a bit broken due to the way that the PCIe IP block >> > was integrated into SoC PCI bridges and devices. The current mechanism >> > of adding a PCI-PCIe bridge to glue these together is not ideal. >> >> Yeah, that's definitely broken. >> >> >> Maybe this example is omitting details that are not relevant to DMA >> >> aliases? The PCIe capability only contains one set of link-related >> >> registers, so I don't know how we could manage a single device that >> >> has two links. >> > >> > The root port is standard and has just one link to the EP (or whatever >> > is on the external PCIe slot). The fallout of the current hw design is >> > that the PCI-PCIe bridge has a link that does not follow standard and >> > does not have a counterpart (as you noted). >> > >> >> A device with two links would break things like ASPM. In >> >> set_pcie_port_type(), for example, we have this comment: >> >> >> >> * A Root Port or a PCI-to-PCIe bridge is always the upstream end >> >> * of a Link. No PCIe component has two Links. Two Links are >> >> * connected by a Switch that has a Port on each Link and internal >> >> * logic to connect the two Ports. >> >> >> >> The topology above breaks these assumptions, which will make >> >> pdev->has_secondary_link incorrect, which means ASPM won't work >> >> correctly. >> > >> > Given the current hardware, the pcieport driver seems to work reasonably >> > for the root port at 02:00.0, with AER support. I will take a look at the >> > ASPM part. >> >> I don't think pcieport itself cares much about links. ASPM does, but >> it looks like set_pcie_port_type() actually is smart enough to know >> that PCI-to-PCIe bridges and Root Ports always have links on their >> secondary sides. So has_secondary_link probably does get set >> correctly. >> >> But I think you'll hit the VIA "strange chipset" thing in >> pcie_aspm_init_link_state(), which will probably prevent us from doing >> ASPM on the link from 02:00.0 to 03:00.0. >> >> Could you collect "lspci -vv" output from this system? I'd like to >> archive that as background for this IOMMU issue and the ASPM tweaks I >> suspect we'll have to do. I *wish* we had more information about that >> VIA thing, because I suspect we could get rid of it if we had more >> details. > > The full logs are slightly large, so I have kept them at: > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt > > The output is from 2 socket system, the cards are not on the first slot > like the example above, so the bus and device numbers are different. > > Looks like I have to spend some time on ASPM next. Thanks, I attached these to https://bugzilla.kernel.org/show_bug.cgi?id=195447 and added that link to the changelogs. 01:0a.0 PCI-to-PCIe bridge to [bus 02-03] Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge lspci doesn't decode the "Slot Implemented" bit here. The spec (PCIe r3.1, sec 7.8.2) isn't explicit about whether that bit is defined for this kind of bridge, but it seems to me like this bridge contains a Downstream Port that could lead to a slot, so we *should* decode "Slot Implemented", and if it does indicate a slot, we should decode the Slot Capabilities, Control, and Status registers as well. Linux also doesn't currently believe this bridge can have a slot below it (see pcie_cap_has_sltctl() and pcie_downstream_port()). I don't know if your topology has actual slots there, but I think the spec does allow it, so Linux probably should handle that. For this port: 02:00.0 Root Port to [bus 03] Capabilities: [ac] Express (v2) Root Port (Slot-) I'm pretty sure there *is* a slot (currently empty), and your lspci output shows "Slot-", which seems wrong to me. It should show "Slot+" with Presence Detect State showing "Slot Empty", shouldn't it? Bjorn
Hi Bjorn, JC, On 04/13/2017 08:19 PM, Bjorn Helgaas wrote: > I tentatively applied both patches to pci/host-thunder for v4.12. Thanks for that :) > However, I am concerned about the topology here: Various feedback has been provided on this one over the past $time. In addition, I have requested that this serve as an example of why we need a more complete PCIe integration guide for ARMv8. It's on the list of things for my intended opus magnum on the topic ;) > On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: >> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >> PCI topology is slightly unusual. For a multi-node system, it looks >> like: >> >> 00:00.0 [PCI] bridge to [bus 01-1e] >> 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >> 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >> 03:00.0 PCIe Endpoint > > A root port normally has a single PCIe link leading downstream. In integration terms, there's always a bus on the other side of the RC. It's just usually a processor local bus of some kind on a proprietary interconnect, but there's always something there. The issue is when you can see it as PCIe and it's not through a transparent glue bridge. I had originally hoped that the ECAM could be hacked up so that we would first walk the topology at the 02:00.0 as a root and not see what's above it BUT there are other device attachments up there for the on-SoC devices and I think we really intend to use those. Bottom line in my opinion is document this case, use it as a learning example, and move forward. This has been useful in justifying why we need better integration documentation from the server community. And in particular from the OS vendors, of which I guess we can allude to their being more than Linux interested in ARM server these days ;) Jon.
One additional footnote. I spent a bunch of time recently on my personal Xeon systems walking through the PCIe topology and aligning on how to advise the ARM server community proceed going forward. If you look at how Intel vs AMD handle their host bridges for example, you'll see two very different approaches to the case of cross-socket PCIe. But my operating assumption is that anything longer term which looks boring and x86 enough is probably fine from an ARM server point of view. On 04/19/2017 07:38 PM, Jon Masters wrote: > Hi Bjorn, JC, > > On 04/13/2017 08:19 PM, Bjorn Helgaas wrote: > >> I tentatively applied both patches to pci/host-thunder for v4.12. > > Thanks for that :) > >> However, I am concerned about the topology here: > > Various feedback has been provided on this one over the past $time. In > addition, I have requested that this serve as an example of why we need > a more complete PCIe integration guide for ARMv8. It's on the list of > things for my intended opus magnum on the topic ;) > >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: >>> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >>> PCI topology is slightly unusual. For a multi-node system, it looks >>> like: >>> >>> 00:00.0 [PCI] bridge to [bus 01-1e] >>> 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >>> 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >>> 03:00.0 PCIe Endpoint >> >> A root port normally has a single PCIe link leading downstream. > > In integration terms, there's always a bus on the other side of the RC. > It's just usually a processor local bus of some kind on a proprietary > interconnect, but there's always something there. The issue is when you > can see it as PCIe and it's not through a transparent glue bridge. > > I had originally hoped that the ECAM could be hacked up so that we would > first walk the topology at the 02:00.0 as a root and not see what's > above it BUT there are other device attachments up there for the on-SoC > devices and I think we really intend to use those. > > Bottom line in my opinion is document this case, use it as a learning > example, and move forward. This has been useful in justifying why we > need better integration documentation from the server community. And in > particular from the OS vendors, of which I guess we can allude to their > being more than Linux interested in ARM server these days ;) > > Jon. >
On Wed, Apr 19, 2017 at 7:25 PM, Jon Masters <jcm@redhat.com> wrote: > One additional footnote. I spent a bunch of time recently on my personal > Xeon systems walking through the PCIe topology and aligning on how to > advise the ARM server community proceed going forward. If you look at > how Intel vs AMD handle their host bridges for example, you'll see two > very different approaches to the case of cross-socket PCIe. As a learning opportunity for me, can you share "lspci -vv" examples that show this Intel vs AMD difference? Maybe the ACPI host bridge descriptions from dmesg are relevant too? > But my > operating assumption is that anything longer term which looks boring and > x86 enough is probably fine from an ARM server point of view. That sounds pretty safe to me. Bjorn
On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C <jnair@caviumnetworks.com> wrote: > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: >> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C >> <jnair@caviumnetworks.com> wrote: >> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: >> >> I tentatively applied both patches to pci/host-thunder for v4.12. >> >> >> >> However, I am concerned about the topology here: >> >> >> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: >> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >> >> > PCI topology is slightly unusual. For a multi-node system, it looks >> >> > like: >> >> > >> >> > 00:00.0 [PCI] bridge to [bus 01-1e] >> >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >> >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >> >> > 03:00.0 PCIe Endpoint >> >> >> >> A root port normally has a single PCIe link leading downstream. >> >> According to this, 02:00.0 is a root port that has the usual >> >> downstream link leading to 03:00.0, but it also has an upstream link >> >> to 01:0a.0. >> > >> > The PCI topology is a bit broken due to the way that the PCIe IP block >> > was integrated into SoC PCI bridges and devices. The current mechanism >> > of adding a PCI-PCIe bridge to glue these together is not ideal. >> >> Yeah, that's definitely broken. >> >> >> Maybe this example is omitting details that are not relevant to DMA >> >> aliases? The PCIe capability only contains one set of link-related >> >> registers, so I don't know how we could manage a single device that >> >> has two links. >> > >> > The root port is standard and has just one link to the EP (or whatever >> > is on the external PCIe slot). The fallout of the current hw design is >> > that the PCI-PCIe bridge has a link that does not follow standard and >> > does not have a counterpart (as you noted). >> > >> >> A device with two links would break things like ASPM. In >> >> set_pcie_port_type(), for example, we have this comment: >> >> >> >> * A Root Port or a PCI-to-PCIe bridge is always the upstream end >> >> * of a Link. No PCIe component has two Links. Two Links are >> >> * connected by a Switch that has a Port on each Link and internal >> >> * logic to connect the two Ports. >> >> >> >> The topology above breaks these assumptions, which will make >> >> pdev->has_secondary_link incorrect, which means ASPM won't work >> >> correctly. >> > >> > Given the current hardware, the pcieport driver seems to work reasonably >> > for the root port at 02:00.0, with AER support. I will take a look at the >> > ASPM part. >> >> I don't think pcieport itself cares much about links. ASPM does, but >> it looks like set_pcie_port_type() actually is smart enough to know >> that PCI-to-PCIe bridges and Root Ports always have links on their >> secondary sides. So has_secondary_link probably does get set >> correctly. >> >> But I think you'll hit the VIA "strange chipset" thing in >> pcie_aspm_init_link_state(), which will probably prevent us from doing >> ASPM on the link from 02:00.0 to 03:00.0. >> >> Could you collect "lspci -vv" output from this system? I'd like to >> archive that as background for this IOMMU issue and the ASPM tweaks I >> suspect we'll have to do. I *wish* we had more information about that >> VIA thing, because I suspect we could get rid of it if we had more >> details. > > The full logs are slightly large, so I have kept them at: > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt > > The output is from 2 socket system, the cards are not on the first slot > like the example above, so the bus and device numbers are different. Can somebody with this system collect the "lspci -xxxx" output as well? I'm making some lspci changes to handle the PCI-to-PCIe bridge correctly, and I can use the "lspci -xxxx" output to create an lspci test case.
On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote: > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C > <jnair@caviumnetworks.com> wrote: > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: > >> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C > >> <jnair@caviumnetworks.com> wrote: > >> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote: > >> >> I tentatively applied both patches to pci/host-thunder for v4.12. > >> >> > >> >> However, I am concerned about the topology here: > >> >> > >> >> On Thu, Apr 13, 2017 at 08:30:45PM +0000, Jayachandran C wrote: > >> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the > >> >> > PCI topology is slightly unusual. For a multi-node system, it looks > >> >> > like: > >> >> > > >> >> > 00:00.0 [PCI] bridge to [bus 01-1e] > >> >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] > >> >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) > >> >> > 03:00.0 PCIe Endpoint > >> >> > >> >> A root port normally has a single PCIe link leading downstream. > >> >> According to this, 02:00.0 is a root port that has the usual > >> >> downstream link leading to 03:00.0, but it also has an upstream link > >> >> to 01:0a.0. > >> > > >> > The PCI topology is a bit broken due to the way that the PCIe IP block > >> > was integrated into SoC PCI bridges and devices. The current mechanism > >> > of adding a PCI-PCIe bridge to glue these together is not ideal. > >> > >> Yeah, that's definitely broken. > >> > >> >> Maybe this example is omitting details that are not relevant to DMA > >> >> aliases? The PCIe capability only contains one set of link-related > >> >> registers, so I don't know how we could manage a single device that > >> >> has two links. > >> > > >> > The root port is standard and has just one link to the EP (or whatever > >> > is on the external PCIe slot). The fallout of the current hw design is > >> > that the PCI-PCIe bridge has a link that does not follow standard and > >> > does not have a counterpart (as you noted). > >> > > >> >> A device with two links would break things like ASPM. In > >> >> set_pcie_port_type(), for example, we have this comment: > >> >> > >> >> * A Root Port or a PCI-to-PCIe bridge is always the upstream end > >> >> * of a Link. No PCIe component has two Links. Two Links are > >> >> * connected by a Switch that has a Port on each Link and internal > >> >> * logic to connect the two Ports. > >> >> > >> >> The topology above breaks these assumptions, which will make > >> >> pdev->has_secondary_link incorrect, which means ASPM won't work > >> >> correctly. > >> > > >> > Given the current hardware, the pcieport driver seems to work reasonably > >> > for the root port at 02:00.0, with AER support. I will take a look at the > >> > ASPM part. > >> > >> I don't think pcieport itself cares much about links. ASPM does, but > >> it looks like set_pcie_port_type() actually is smart enough to know > >> that PCI-to-PCIe bridges and Root Ports always have links on their > >> secondary sides. So has_secondary_link probably does get set > >> correctly. > >> > >> But I think you'll hit the VIA "strange chipset" thing in > >> pcie_aspm_init_link_state(), which will probably prevent us from doing > >> ASPM on the link from 02:00.0 to 03:00.0. > >> > >> Could you collect "lspci -vv" output from this system? I'd like to > >> archive that as background for this IOMMU issue and the ASPM tweaks I > >> suspect we'll have to do. I *wish* we had more information about that > >> VIA thing, because I suspect we could get rid of it if we had more > >> details. > > > > The full logs are slightly large, so I have kept them at: > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt > > > > The output is from 2 socket system, the cards are not on the first slot > > like the example above, so the bus and device numbers are different. > > Can somebody with this system collect the "lspci -xxxx" output as well? > > I'm making some lspci changes to handle the PCI-to-PCIe bridge > correctly, and I can use the "lspci -xxxx" output to create an lspci > test case. [Sorry was AFK for a few days] I have updated the above directory with the log. Also tested your next branch and it works fine on ThunderX2. JC.
On Fri, Apr 21, 2017 at 05:05:41PM +0000, Jayachandran C wrote: > On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote: > > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C > > <jnair@caviumnetworks.com> wrote: > > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: > > >> Could you collect "lspci -vv" output from this system? I'd like to > > >> archive that as background for this IOMMU issue and the ASPM tweaks I > > >> suspect we'll have to do. I *wish* we had more information about that > > >> VIA thing, because I suspect we could get rid of it if we had more > > >> details. > > > > > > The full logs are slightly large, so I have kept them at: > > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ > > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt > > > > > > The output is from 2 socket system, the cards are not on the first slot > > > like the example above, so the bus and device numbers are different. > > > > Can somebody with this system collect the "lspci -xxxx" output as well? > > > > I'm making some lspci changes to handle the PCI-to-PCIe bridge > > correctly, and I can use the "lspci -xxxx" output to create an lspci > > test case. > > [Sorry was AFK for a few days] > > I have updated the above directory with the log. Also tested your next branch > and it works fine on ThunderX2. Thanks! With regard to my lspci changes, they add "Slot-" here: 01:0a.0 PCI bridge: Broadcom Corporation Device 9039 ... - Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge, MSI 00 + Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge (Slot-), MSI 00 for all your PCI-to-PCIe bridges. I assume the "Slot-" is correct, i.e., the link is not connected to a slot, right? This comes from the "Slot Implemented" bit in the PCIe Capabilities Register. I did notice that all the Root Port devices claim to *not* be connected to slots, which doesn't seem right. For example, 12:00.0 PCI bridge: Broadcom Corporation Device 9084 Bus: primary=12, secondary=13, subordinate=14, sec-latency=0 Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00 13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection It seems strange because the 12:00.0 Root Port looks like it probably *does* lead to a slot where the NIC is plugged in. Or is that NIC really soldered down? But I assume there are *some* PCIe slots, so at some of those Root Ports should advertise "Slot+" (which by itself does not imply hotplug support, if that's the concern). Bjorn
On Fri, Apr 21, 2017 at 12:57:05PM -0500, Bjorn Helgaas wrote: > On Fri, Apr 21, 2017 at 05:05:41PM +0000, Jayachandran C wrote: > > On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote: > > > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C > > > <jnair@caviumnetworks.com> wrote: > > > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote: > > > > >> Could you collect "lspci -vv" output from this system? I'd like to > > > >> archive that as background for this IOMMU issue and the ASPM tweaks I > > > >> suspect we'll have to do. I *wish* we had more information about that > > > >> VIA thing, because I suspect we could get rid of it if we had more > > > >> details. > > > > > > > > The full logs are slightly large, so I have kept them at: > > > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/ > > > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt > > > > > > > > The output is from 2 socket system, the cards are not on the first slot > > > > like the example above, so the bus and device numbers are different. > > > > > > Can somebody with this system collect the "lspci -xxxx" output as well? > > > > > > I'm making some lspci changes to handle the PCI-to-PCIe bridge > > > correctly, and I can use the "lspci -xxxx" output to create an lspci > > > test case. > > > > [Sorry was AFK for a few days] > > > > I have updated the above directory with the log. Also tested your next branch > > and it works fine on ThunderX2. > > Thanks! > > With regard to my lspci changes, they add "Slot-" here: > > 01:0a.0 PCI bridge: Broadcom Corporation Device 9039 > ... > - Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge, MSI 00 > + Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge (Slot-), MSI 00 > > for all your PCI-to-PCIe bridges. I assume the "Slot-" is correct, i.e., > the link is not connected to a slot, right? This comes from the "Slot > Implemented" bit in the PCIe Capabilities Register. Yes, Slot- should be correct. > I did notice that all the Root Port devices claim to *not* be connected to > slots, which doesn't seem right. For example, > > 12:00.0 PCI bridge: Broadcom Corporation Device 9084 > Bus: primary=12, secondary=13, subordinate=14, sec-latency=0 > Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00 > > 13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection > > It seems strange because the 12:00.0 Root Port looks like it probably > *does* lead to a slot where the NIC is plugged in. Or is that NIC really > soldered down? > > But I assume there are *some* PCIe slots, so at some of those Root Ports > should advertise "Slot+" (which by itself does not imply hotplug support, > if that's the concern). The Root Ports are connected to a slot, so I am not sure why the slot implemented bit is not set. There seems to be nothing useful in the slot capabilites, so this may be ok for now. I have reported this to the hardware team. Thanks for the review and the comments, JC.
On Tue, Apr 25, 2017 at 8:03 AM, Jayachandran C <jnair@caviumnetworks.com> wrote: > On Fri, Apr 21, 2017 at 12:57:05PM -0500, Bjorn Helgaas wrote: >> I did notice that all the Root Port devices claim to *not* be connected to >> slots, which doesn't seem right. For example, >> >> 12:00.0 PCI bridge: Broadcom Corporation Device 9084 >> Bus: primary=12, secondary=13, subordinate=14, sec-latency=0 >> Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00 >> >> 13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection >> >> It seems strange because the 12:00.0 Root Port looks like it probably >> *does* lead to a slot where the NIC is plugged in. Or is that NIC really >> soldered down? >> >> But I assume there are *some* PCIe slots, so at some of those Root Ports >> should advertise "Slot+" (which by itself does not imply hotplug support, >> if that's the concern). > > The Root Ports are connected to a slot, so I am not sure why the slot implemented > bit is not set. There seems to be nothing useful in the slot capabilites, so this > may be ok for now. I have reported this to the hardware team. Thanks. The "Slot Implemented" bit and the slot registers aren't essential if you don't want to support hotplug on those slots. But even without hotplug, they do contain things like a slot number, which may be useful in the user interface.
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6736836..564a84a 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); /* + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are + * associated not at the root bus, but at a bridge below. This quirk flag + * will ensure that the aliases are identified correctly. + */ +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) +{ + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; +} +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, + quirk_bridge_cavm_thrx2_pcie_root); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, + quirk_bridge_cavm_thrx2_pcie_root); + +/* * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) * class code. Fix it. */