Message ID | 1491225304-3559-3-git-send-email-jnair@caviumnetworks.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On 03/04/17 14:15, Jayachandran C wrote: > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > topology is slightly unusual. For a multi-node system, it looks like: > > [node level PCI bridges - one per node] > [SoC PCI devices with MSI-X but no IOMMU] > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > [PCIe real root ports associated with IOMMU and GICv3 ITS] > [External PCI devices connected to PCIe links] Since it's not entirely obvious, what does the actual DT - or IORT if you must ;) - topology for this look like? I can't help thinking that either it's inaccurate, or that this is going to expose a shortcoming in pci_dma_configure() which breaks things - unless I'm missing something, isn't find_pci_root_bus() going to go all the way up to the top-level glue bridge and pick up the wrong firmware node (if any) for the appropriate DMA properties? Robin. > The top two levels of bridges should have introduced aliases since they > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > In the case of external PCIe devices, the "real" root ports are connected > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > node level bridges do not introduce an alias either. > > To handle this quirk, we mark the real PCIe root ports and node level > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > pci_for_each_dma_alias() works correctly for external PCIe devices and > SoC PCI devices. > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > are from Broadcom Vulcan (14e4:90XX). > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> > --- > drivers/pci/quirks.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 6736836..564a84a 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); > > /* > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are > + * associated not at the root bus, but at a bridge below. This quirk flag > + * will ensure that the aliases are identified correctly. > + */ > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) > +{ > + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; > +} > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, > + quirk_bridge_cavm_thrx2_pcie_root); > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, > + quirk_bridge_cavm_thrx2_pcie_root); > + > +/* > * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) > * class code. Fix it. > */ >
On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote: > On 03/04/17 14:15, Jayachandran C wrote: > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > topology is slightly unusual. For a multi-node system, it looks like: > > > > [node level PCI bridges - one per node] > > [SoC PCI devices with MSI-X but no IOMMU] > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > [External PCI devices connected to PCIe links] > > Since it's not entirely obvious, what does the actual DT - or IORT if > you must ;) - topology for this look like? I can't help thinking that > either it's inaccurate, or that this is going to expose a shortcoming in > pci_dma_configure() which breaks things - unless I'm missing something, > isn't find_pci_root_bus() going to go all the way up to the top-level > glue bridge and pick up the wrong firmware node (if any) for the > appropriate DMA properties? I will try to describe the ACPI interface: There is just one ECAM area, a single bus range and one set of memory windows for the whole system - so there is just one entry in DSDT for the PCI controller. This entry also corresponds to the PCI RC node in IORT. DMA is coherent and supports 64 bits system-wide, the attributes (in DSDT and IORT) reflect this. lspci on the system looks like this: -[0000:00]-+-00.0-[01-1e]--+-04.0 14e4:9026 | +-04.1 14e4:9026 | +-05.0 14e4:9027 | +-05.1 14e4:9027 | +-0a.0-[02-03]----00.0-[03]-- | +-0a.1-[04-05]----00.0-[05]-- | [...etc...] | +-0b.0-[12-14]----00.0-[13-14]--+-00.0 8086:1583 | | \-00.1 8086:1583 | [...etc...] | \-0b.5-[1d-1e]----00.0-[1e]-- \-00.1-[1f-3b]--+-04.0 14e4:9026 +-04.1 14e4:9026 +-05.0 14e4:9027 +-05.1 14e4:9027 +-0a.0-[20-21]----00.0-[21]-- [...etc...] The devices here are: - 00:00.0 and 00:00.1 are the node (socket) level bridges - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU. The IORT is built by the firmware based on its PCI enumeration. The IORT will have multiple entries under the PCI RC node: - one entry per node to map the SoC devices directly to ITS for MSI-X, since the SoC devices are not attached to any SMMU. - An entry per "real" PCIe port to map RIDs under it to the corresponding SMMU. The SMMU nodes will have an entry to map its RID ranges to the node ITS. The IORT spec supports this configuration, and the corresponding code is already upstream, so the only sticking point right now is pci_for_each_dma_alias(). JC.
On 04/04/17 12:50, Jayachandran C wrote: > On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote: >> On 03/04/17 14:15, Jayachandran C wrote: >>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI >>> topology is slightly unusual. For a multi-node system, it looks like: >>> >>> [node level PCI bridges - one per node] >>> [SoC PCI devices with MSI-X but no IOMMU] >>> [PCI-PCIe "glue" bridges - upto 14, one per real port below] >>> [PCIe real root ports associated with IOMMU and GICv3 ITS] >>> [External PCI devices connected to PCIe links] >> >> Since it's not entirely obvious, what does the actual DT - or IORT if >> you must ;) - topology for this look like? I can't help thinking that >> either it's inaccurate, or that this is going to expose a shortcoming in >> pci_dma_configure() which breaks things - unless I'm missing something, >> isn't find_pci_root_bus() going to go all the way up to the top-level >> glue bridge and pick up the wrong firmware node (if any) for the >> appropriate DMA properties? > > I will try to describe the ACPI interface: > > There is just one ECAM area, a single bus range and one set of memory > windows for the whole system - so there is just one entry in DSDT for > the PCI controller. This entry also corresponds to the PCI RC node in > IORT. DMA is coherent and supports 64 bits system-wide, the attributes > (in DSDT and IORT) reflect this. > > lspci on the system looks like this: > -[0000:00]-+-00.0-[01-1e]--+-04.0 14e4:9026 > | +-04.1 14e4:9026 > | +-05.0 14e4:9027 > | +-05.1 14e4:9027 > | +-0a.0-[02-03]----00.0-[03]-- > | +-0a.1-[04-05]----00.0-[05]-- > | [...etc...] > | +-0b.0-[12-14]----00.0-[13-14]--+-00.0 8086:1583 > | | \-00.1 8086:1583 > | [...etc...] > | \-0b.5-[1d-1e]----00.0-[1e]-- > \-00.1-[1f-3b]--+-04.0 14e4:9026 > +-04.1 14e4:9026 > +-05.0 14e4:9027 > +-05.1 14e4:9027 > +-0a.0-[20-21]----00.0-[21]-- > [...etc...] > > The devices here are: > - 00:00.0 and 00:00.1 are the node (socket) level bridges > - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB > - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges > - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. > Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU. > > The IORT is built by the firmware based on its PCI enumeration. The IORT > will have multiple entries under the PCI RC node: > - one entry per node to map the SoC devices directly to ITS for MSI-X, > since the SoC devices are not attached to any SMMU. > - An entry per "real" PCIe port to map RIDs under it to the corresponding > SMMU. > The SMMU nodes will have an entry to map its RID ranges to the node ITS. > > The IORT spec supports this configuration, and the corresponding code is > already upstream, so the only sticking point right now is > pci_for_each_dma_alias(). Thanks, that helps a lot. The "single global ECAM space" idea was eluding me, but in that context it all makes much more sense - I'm assuming the two quirked device IDs correspond to the 00:00.[01] devices and the [02-1e]:00.0 ones. So (at the risk of Jon mooing at me), I guess the DT description would be a single node looking something like: pcie { reg = [global ECAM space for segment 0000]; msi-map = <0x0100 &its0 0x0100 0x1d00>, <0x1f00 &its1 0x1f00 0x1d00>; iommu-map = <0x0200 &smmu0 0x0200 0x1c00>, <0x2000 &smmu0 0x2000 0x1c00>; }; (note to self: which incidentally also means of_pci_map_rid() probably wants fixing to not treat gaps in the map as an error) With only one node like that, rather than having the whole first 3 levels of bridges described, the "stop at the appropriate node in the callback" approach does become even more impractical in all cases. So, for $TITLE, based on the above understanding: Reviewed-by: Robin Murphy <robin.murphy@arm.com> Cheers, Robin. > > JC. >
[Moving Bjorn back to to: ] On Tue, Apr 04, 2017 at 03:28:26PM +0100, Robin Murphy wrote: > On 04/04/17 12:50, Jayachandran C wrote: > > On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote: > >> On 03/04/17 14:15, Jayachandran C wrote: > >>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > >>> topology is slightly unusual. For a multi-node system, it looks like: > >>> > >>> [node level PCI bridges - one per node] > >>> [SoC PCI devices with MSI-X but no IOMMU] > >>> [PCI-PCIe "glue" bridges - upto 14, one per real port below] > >>> [PCIe real root ports associated with IOMMU and GICv3 ITS] > >>> [External PCI devices connected to PCIe links] > >> > >> Since it's not entirely obvious, what does the actual DT - or IORT if > >> you must ;) - topology for this look like? I can't help thinking that > >> either it's inaccurate, or that this is going to expose a shortcoming in > >> pci_dma_configure() which breaks things - unless I'm missing something, > >> isn't find_pci_root_bus() going to go all the way up to the top-level > >> glue bridge and pick up the wrong firmware node (if any) for the > >> appropriate DMA properties? > > > > I will try to describe the ACPI interface: > > > > There is just one ECAM area, a single bus range and one set of memory > > windows for the whole system - so there is just one entry in DSDT for > > the PCI controller. This entry also corresponds to the PCI RC node in > > IORT. DMA is coherent and supports 64 bits system-wide, the attributes > > (in DSDT and IORT) reflect this. > > > > lspci on the system looks like this: > > -[0000:00]-+-00.0-[01-1e]--+-04.0 14e4:9026 > > | +-04.1 14e4:9026 > > | +-05.0 14e4:9027 > > | +-05.1 14e4:9027 > > | +-0a.0-[02-03]----00.0-[03]-- > > | +-0a.1-[04-05]----00.0-[05]-- > > | [...etc...] > > | +-0b.0-[12-14]----00.0-[13-14]--+-00.0 8086:1583 > > | | \-00.1 8086:1583 > > | [...etc...] > > | \-0b.5-[1d-1e]----00.0-[1e]-- > > \-00.1-[1f-3b]--+-04.0 14e4:9026 > > +-04.1 14e4:9026 > > +-05.0 14e4:9027 > > +-05.1 14e4:9027 > > +-0a.0-[20-21]----00.0-[21]-- > > [...etc...] > > > > The devices here are: > > - 00:00.0 and 00:00.1 are the node (socket) level bridges > > - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB > > - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges > > - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. > > Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU. > > > > The IORT is built by the firmware based on its PCI enumeration. The IORT > > will have multiple entries under the PCI RC node: > > - one entry per node to map the SoC devices directly to ITS for MSI-X, > > since the SoC devices are not attached to any SMMU. > > - An entry per "real" PCIe port to map RIDs under it to the corresponding > > SMMU. > > The SMMU nodes will have an entry to map its RID ranges to the node ITS. > > > > The IORT spec supports this configuration, and the corresponding code is > > already upstream, so the only sticking point right now is > > pci_for_each_dma_alias(). > > Thanks, that helps a lot. The "single global ECAM space" idea was > eluding me, but in that context it all makes much more sense - I'm > assuming the two quirked device IDs correspond to the 00:00.[01] devices > and the [02-1e]:00.0 ones. > > So (at the risk of Jon mooing at me), I guess the DT description would > be a single node looking something like: > > pcie { > reg = [global ECAM space for segment 0000]; > > msi-map = <0x0100 &its0 0x0100 0x1d00>, > <0x1f00 &its1 0x1f00 0x1d00>; > iommu-map = <0x0200 &smmu0 0x0200 0x1c00>, > <0x2000 &smmu0 0x2000 0x1c00>; > }; > > (note to self: which incidentally also means of_pci_map_rid() probably > wants fixing to not treat gaps in the map as an error) > > With only one node like that, rather than having the whole first 3 > levels of bridges described, the "stop at the appropriate node in the > callback" approach does become even more impractical in all cases. So, > for $TITLE, based on the above understanding: > > Reviewed-by: Robin Murphy <robin.murphy@arm.com> Hi Bjorn, This seems to be the reasonable way to add support for the quirk. Would really appreciate feedback from you. Thanks, JC.
Hi Jayachandran, On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > topology is slightly unusual. For a multi-node system, it looks like: > > [node level PCI bridges - one per node] > [SoC PCI devices with MSI-X but no IOMMU] > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > [PCIe real root ports associated with IOMMU and GICv3 ITS] > [External PCI devices connected to PCIe links] > > The top two levels of bridges should have introduced aliases since they > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > In the case of external PCIe devices, the "real" root ports are connected > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > node level bridges do not introduce an alias either. > > To handle this quirk, we mark the real PCIe root ports and node level > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > pci_for_each_dma_alias() works correctly for external PCIe devices and > SoC PCI devices. > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > are from Broadcom Vulcan (14e4:90XX). Can you supply some text here about why we want to apply this patch? E.g., does it avoid making unnecessary IOMMU mappings, improve performance, avoid a crash, etc? > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> > --- > drivers/pci/quirks.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 6736836..564a84a 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); > > /* > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are > + * associated not at the root bus, but at a bridge below. This quirk flag > + * will ensure that the aliases are identified correctly. > + */ > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) > +{ > + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; > +} > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, > + quirk_bridge_cavm_thrx2_pcie_root); > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, > + quirk_bridge_cavm_thrx2_pcie_root); > + > +/* > * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) > * class code. Fix it. > */ > -- > 2.7.4 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > Hi Jayachandran, > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > topology is slightly unusual. For a multi-node system, it looks like: > > > > [node level PCI bridges - one per node] > > [SoC PCI devices with MSI-X but no IOMMU] > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > [External PCI devices connected to PCIe links] > > > > The top two levels of bridges should have introduced aliases since they > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > In the case of external PCIe devices, the "real" root ports are connected > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > node level bridges do not introduce an alias either. > > > > To handle this quirk, we mark the real PCIe root ports and node level > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > SoC PCI devices. > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > are from Broadcom Vulcan (14e4:90XX). > > Can you supply some text here about why we want to apply this patch? > E.g., does it avoid making unnecessary IOMMU mappings, improve > performance, avoid a crash, etc? If this is for the commit message, I hope the following is ok: "With this change, both MSI-X and IO virtualization work correctly on Cavium ThunderX2. The GIC ITS driver gets the correct device ID to configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI devices, and the IOMMU groups are setup correctly." I can send out a new patch if needed. The on chip SATA and USB use MSI-X, so this is needed for basic functionality of the platform. > > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> > > --- > > drivers/pci/quirks.c | 14 ++++++++++++++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 6736836..564a84a 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); > > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); > > > > /* > > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are > > + * associated not at the root bus, but at a bridge below. This quirk flag > > + * will ensure that the aliases are identified correctly. > > + */ > > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) > > +{ > > + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; > > +} > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, > > + quirk_bridge_cavm_thrx2_pcie_root); > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, > > + quirk_bridge_cavm_thrx2_pcie_root); > > + > > +/* > > * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) > > * class code. Fix it. > > */ Thanks, JC.
[+cc Joerg] On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > Hi Jayachandran, > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > [node level PCI bridges - one per node] > > > [SoC PCI devices with MSI-X but no IOMMU] > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > [External PCI devices connected to PCIe links] > > > > > > The top two levels of bridges should have introduced aliases since they > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > In the case of external PCIe devices, the "real" root ports are connected > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > node level bridges do not introduce an alias either. > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > SoC PCI devices. > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > are from Broadcom Vulcan (14e4:90XX). > > > > Can you supply some text here about why we want to apply this patch? > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > performance, avoid a crash, etc? > > If this is for the commit message, I hope the following is ok: > > "With this change, both MSI-X and IO virtualization work correctly on > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > devices, and the IOMMU groups are setup correctly." This doesn't get at what the actual problem is. I'm hoping for something like "without this change, we set up an IOMMU mapping for requestor ID X, but device DMA uses requestor ID Y because ...., which results in an IOMMU fault" I've been puzzling over the fact that most of the callers of pci_for_each_dma_alias() don't seem to use it correctly. For Intel IOMMUs, domain_context_mapping() uses it to add a mapping for every possible alias. But most of the other callers only look at the last alias and ignore all the others. That might work most of the time, but: - There's no guarantee that pci_for_each_dma_alias() iterates in any particular order, so relying on the current order is fragile, - The pci_add_dma_alias() interface allows an arbitrary number of aliases (as long as they're all on the same bus), and some devices do use more than one, e.g., quirk_dma_func0_alias(), quirk_mic_x200_dma_alias(), - pci_for_each_dma_alias() translates the rules in the PCIe to PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into aliases. I think it's important to pay attention to *every* possible alias, not just the last one. I suspect the reason this patch makes a difference is because the current pci_for_each_dma_alias() believes one of those top-level bridges is an alias, and the iterator produces it last, so that's the one you map. The IOMMU is attached lower down, so that top-level bridge is not in fact an alias, but since you only look at the *last* one, you don't map the correct aliases from lower down in the tree. Stopping the iterator earlier happens to make the last alias be one of the correct ones, but it doesn't solve the problems of quirked devices that can use multiple requester IDs, and it doesn't solve the problem of PCIe-to-PCI bridges that optionally take ownership of transactions. > I can send out a new patch if needed. > > The on chip SATA and USB use MSI-X, so this is needed for basic > functionality of the platform. No need for a new patch; I can integrate something into the changelog. > > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> > > > --- > > > drivers/pci/quirks.c | 14 ++++++++++++++ > > > 1 file changed, 14 insertions(+) > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > index 6736836..564a84a 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); > > > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); > > > > > > /* > > > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are > > > + * associated not at the root bus, but at a bridge below. This quirk flag > > > + * will ensure that the aliases are identified correctly. > > > + */ > > > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) > > > +{ > > > + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; > > > +} > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, > > > + quirk_bridge_cavm_thrx2_pcie_root); > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, > > > + quirk_bridge_cavm_thrx2_pcie_root); > > > + > > > +/* > > > * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) > > > * class code. Fix it. > > > */ > > Thanks, > JC. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > [+cc Joerg] > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > Hi Jayachandran, > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > [node level PCI bridges - one per node] > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > [External PCI devices connected to PCIe links] > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > node level bridges do not introduce an alias either. > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > SoC PCI devices. > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > Can you supply some text here about why we want to apply this patch? > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > performance, avoid a crash, etc? > > > > If this is for the commit message, I hope the following is ok: > > > > "With this change, both MSI-X and IO virtualization work correctly on > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > devices, and the IOMMU groups are setup correctly." > > This doesn't get at what the actual problem is. I'm hoping for > something like "without this change, we set up an IOMMU mapping for > requestor ID X, but device DMA uses requestor ID Y because ...., which > results in an IOMMU fault" Ok. I hope this would be better: "Without this change, the last alias seen while traversing the PCI hierarchy will be used as the RID to generate the device ID for ITS and stream ID for SMMU. This in turn causes the MSI-X generated by the device to fail since the ITS expects to have translation tables based on the actual PCIe RID and not the (irrelevant) alias. Similarly, the device DMA also fails when SMMU is enabled due to incorrect value in SMMU translation tables" > I've been puzzling over the fact that most of the callers of > pci_for_each_dma_alias() don't seem to use it correctly. For Intel > IOMMUs, domain_context_mapping() uses it to add a mapping for every > possible alias. But most of the other callers only look at the last > alias and ignore all the others. That might work most of the time, > but: > > - There's no guarantee that pci_for_each_dma_alias() iterates in any > particular order, so relying on the current order is fragile, > > - The pci_add_dma_alias() interface allows an arbitrary number of > aliases (as long as they're all on the same bus), and some devices > do use more than one, e.g., quirk_dma_func0_alias(), > quirk_mic_x200_dma_alias(), > > - pci_for_each_dma_alias() translates the rules in the PCIe to > PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into > aliases. I think it's important to pay attention to *every* > possible alias, not just the last one. pci_for_each_dma_alias() is used by the ARM code to find the RID (Requester ID), and this is taken as the last alias as seen from the PCI controller (RC). The RID is then used to program the Device ID of the GIC ITS (ARM generic interrupt controller's interrupt translation service) for MSI-X (and similarly to program Stream ID of the SMMU). The translation from RID to Device ID or stream ID is provided by the IORT ACPI table[1] or by the a {iommu,msi}-{map,mask} [2] property in the device tree. Taking the last alias maybe reasonable since the mapping is from (PCI RC, RID) to (SMMU, streamID) or (GIC ITS, deviceID) and we are looking for a single the RID for a device as seen from the controller. > I suspect the reason this patch makes a difference is because the > current pci_for_each_dma_alias() believes one of those top-level > bridges is an alias, and the iterator produces it last, so that's the > one you map. The IOMMU is attached lower down, so that top-level > bridge is not in fact an alias, but since you only look at the *last* > one, you don't map the correct aliases from lower down in the tree. Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which means that a PCI RC can multiple SMMUs, each handling a subset of RIDs. In the case of Cavium ThunderX2, the RID which we should see on the RC - if we follow the standard and factor in the aliasing introduced by the PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or ITS). But, if we stop the traversal at the point where SMMU (or ITS) is attached, we will get the correct RID as seen by these. > Stopping the iterator earlier happens to make the last alias be one of > the correct ones, but it doesn't solve the problems of quirked devices > that can use multiple requester IDs, and it doesn't solve the problem > of PCIe-to-PCI bridges that optionally take ownership of transactions. If these happen below the point where the SMMU is attached, we will consider the last alias introduced, which should be ok. If they are above, the alias introduced is not relevant. Devices with multiple aliases is not handled anywhere in ARM code, so I don't think we should consider that here. > > I can send out a new patch if needed. > > > > The on chip SATA and USB use MSI-X, so this is needed for basic > > functionality of the platform. > > No need for a new patch; I can integrate something into the changelog. > > > > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> > > > > --- > > > > drivers/pci/quirks.c | 14 ++++++++++++++ > > > > 1 file changed, 14 insertions(+) > > > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > > index 6736836..564a84a 100644 > > > > --- a/drivers/pci/quirks.c > > > > +++ b/drivers/pci/quirks.c > > > > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); > > > > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); > > > > > > > > /* > > > > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are > > > > + * associated not at the root bus, but at a bridge below. This quirk flag > > > > + * will ensure that the aliases are identified correctly. > > > > + */ > > > > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) > > > > +{ > > > > + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; > > > > +} > > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, > > > > + quirk_bridge_cavm_thrx2_pcie_root); > > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, > > > > + quirk_bridge_cavm_thrx2_pcie_root); > > > > + > > > > +/* > > > > * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) > > > > * class code. Fix it. > > > > */ Thanks, JC. [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf [2] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt
On 11/04/17 14:41, Bjorn Helgaas wrote: > [+cc Joerg] > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: >> On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: >>> Hi Jayachandran, >>> >>> On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: >>>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI >>>> topology is slightly unusual. For a multi-node system, it looks like: >>>> >>>> [node level PCI bridges - one per node] >>>> [SoC PCI devices with MSI-X but no IOMMU] >>>> [PCI-PCIe "glue" bridges - upto 14, one per real port below] >>>> [PCIe real root ports associated with IOMMU and GICv3 ITS] >>>> [External PCI devices connected to PCIe links] >>>> >>>> The top two levels of bridges should have introduced aliases since they >>>> are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. >>>> In the case of external PCIe devices, the "real" root ports are connected >>>> to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an >>>> alias. The SoC PCI devices are directly connected to the GIC ITS, so the >>>> node level bridges do not introduce an alias either. >>>> >>>> To handle this quirk, we mark the real PCIe root ports and node level >>>> PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, >>>> pci_for_each_dma_alias() works correctly for external PCIe devices and >>>> SoC PCI devices. >>>> >>>> For the current revision of Cavium ThunderX2, the VendorID and Device ID >>>> are from Broadcom Vulcan (14e4:90XX). >>> >>> Can you supply some text here about why we want to apply this patch? >>> E.g., does it avoid making unnecessary IOMMU mappings, improve >>> performance, avoid a crash, etc? >> >> If this is for the commit message, I hope the following is ok: >> >> "With this change, both MSI-X and IO virtualization work correctly on >> Cavium ThunderX2. The GIC ITS driver gets the correct device ID to >> configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI >> devices, and the IOMMU groups are setup correctly." > > This doesn't get at what the actual problem is. I'm hoping for > something like "without this change, we set up an IOMMU mapping for > requestor ID X, but device DMA uses requestor ID Y because ...., which > results in an IOMMU fault" The fact that certain bits of code only consider the last-seen alias RID does mean we will install all mappings for a RID the IOMMU will never see, causing subsequent attempt to access them (via the real RID) to fault. Even with that fixed, considering the past-the-IOMMU-connection bridges will cause IOMMU group assignment to believe the IOMMU can't distinguish them and thus prevent individual devices being assignable to different VMs. > I've been puzzling over the fact that most of the callers of > pci_for_each_dma_alias() don't seem to use it correctly. For Intel > IOMMUs, domain_context_mapping() uses it to add a mapping for every > possible alias. But most of the other callers only look at the last > alias and ignore all the others. As it happens, I've just been looking into this and reaching the same conclusion (not least because a fair few of of the incorrect uses have my fingerprints all over them). The of_iommu_configure() and corresponding iort_iommu_configure() uses turn out to already be broken WRT DMA phantom functions vs. MSIs, but pretty straightforward to fix (and I now have one of the offending Marvell SATA cards to test things with). The one in its_pci_msi_prepare() turns out to be totally backwards, as it actually wants to iterate through every device which may alias with the given device (any suggestions for the neatest way to do that are most welcome). The other uses in pci_msi_*() are a little trickier, as in general we're assuming every device is associated with just a single ID (certainly for ARM GICv3 this assumption runs quite strongly through the architecture) - Marc and I have chucked some ideas around, but in the short term, it might be easier to just not even pretend to support MSIs from behind aliasing bridges for the DT/IORT cases. Either way, those are also broken for the phantom function case (and the tentative fix I've written will need rethinking in light of this discussion, oh well). > That might work most of the time, > but: > > - There's no guarantee that pci_for_each_dma_alias() iterates in any > particular order, so relying on the current order is fragile, > > - The pci_add_dma_alias() interface allows an arbitrary number of > aliases (as long as they're all on the same bus), and some devices > do use more than one, e.g., quirk_dma_func0_alias(), > quirk_mic_x200_dma_alias(), > > - pci_for_each_dma_alias() translates the rules in the PCIe to > PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into > aliases. I think it's important to pay attention to *every* > possible alias, not just the last one. Ha, that exact page is still open on my desktop since the "oh crap!" moment last Friday :) If I've interpreted that spec correctly, and it definitely is the case that a bridge may alias or not on a per-transaction basis, then that does end up making matters simpler; I can remove all the attempts to skip any IDs that the IOMMU is guaranteed never to see due to aliasing, if that set is in fact empty. > I suspect the reason this patch makes a difference is because the > current pci_for_each_dma_alias() believes one of those top-level > bridges is an alias, and the iterator produces it last, so that's the > one you map. The IOMMU is attached lower down, so that top-level > bridge is not in fact an alias, but since you only look at the *last* > one, you don't map the correct aliases from lower down in the tree. > > Stopping the iterator earlier happens to make the last alias be one of > the correct ones, but it doesn't solve the problems of quirked devices > that can use multiple requester IDs, and it doesn't solve the problem > of PCIe-to-PCI bridges that optionally take ownership of transactions. Yes, that's pretty much the state of things, other than also solving the legitimate problem of get_pci_alias_group() going too far and inferring a false lack of isolation. Robin. >> I can send out a new patch if needed. >> >> The on chip SATA and USB use MSI-X, so this is needed for basic >> functionality of the platform. > > No need for a new patch; I can integrate something into the changelog. > >>>> Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> >>>> --- >>>> drivers/pci/quirks.c | 14 ++++++++++++++ >>>> 1 file changed, 14 insertions(+) >>>> >>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >>>> index 6736836..564a84a 100644 >>>> --- a/drivers/pci/quirks.c >>>> +++ b/drivers/pci/quirks.c >>>> @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); >>>> >>>> /* >>>> + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are >>>> + * associated not at the root bus, but at a bridge below. This quirk flag >>>> + * will ensure that the aliases are identified correctly. >>>> + */ >>>> +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) >>>> +{ >>>> + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; >>>> +} >>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, >>>> + quirk_bridge_cavm_thrx2_pcie_root); >>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, >>>> + quirk_bridge_cavm_thrx2_pcie_root); >>>> + >>>> +/* >>>> * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) >>>> * class code. Fix it. >>>> */ >> >> Thanks, >> JC. >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On 04/11/2017 11:27 AM, Jayachandran C wrote: > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: >> I suspect the reason this patch makes a difference is because the >> current pci_for_each_dma_alias() believes one of those top-level >> bridges is an alias, and the iterator produces it last, so that's the >> one you map. The IOMMU is attached lower down, so that top-level >> bridge is not in fact an alias, but since you only look at the *last* >> one, you don't map the correct aliases from lower down in the tree. > > Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which > means that a PCI RC can multiple SMMUs, each handling a subset of RIDs. > > In the case of Cavium ThunderX2, the RID which we should see on the RC > - if we follow the standard and factor in the aliasing introduced by the > PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or > ITS). > > But, if we stop the traversal at the point where SMMU (or ITS) is > attached, we will get the correct RID as seen by these. Side note that I am trying to get various specifications clarified to promote more of a familiar alternative architecture (x86) approach in the future in which these aren't at different levels in the topology. But to do that requires integrated Root Complex IP with bells/whistles. Jon.
On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote: > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > > [+cc Joerg] > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > > Hi Jayachandran, > > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > > > [node level PCI bridges - one per node] > > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > > [External PCI devices connected to PCIe links] > > > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > > node level bridges do not introduce an alias either. > > > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > > SoC PCI devices. > > > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > > > Can you supply some text here about why we want to apply this patch? > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > > performance, avoid a crash, etc? > > > > > > If this is for the commit message, I hope the following is ok: > > > > > > "With this change, both MSI-X and IO virtualization work correctly on > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > > devices, and the IOMMU groups are setup correctly." > > > > This doesn't get at what the actual problem is. I'm hoping for > > something like "without this change, we set up an IOMMU mapping for > > requestor ID X, but device DMA uses requestor ID Y because ...., which > > results in an IOMMU fault" > > Ok. I hope this would be better: > > "Without this change, the last alias seen while traversing the PCI > hierarchy will be used as the RID to generate the device ID for ITS > and stream ID for SMMU. This in turn causes the MSI-X generated by the > device to fail since the ITS expects to have translation tables based > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the > device DMA also fails when SMMU is enabled due to incorrect value in > SMMU translation tables" This description is true, but I don't think it addresses the real problem. I think the real problem is that your IOMMU code doesn't handle aliases correctly, and by ignoring these invalid aliases, we happen to map an alias that works for the builtin devices. But that's only because we got lucky (those devices use a single RID and they're not behind bridges that optionally take ownership). It would make sense to me if we fixed the IOMMU code to map *all* the aliases, which should be enough to make your devices work. If we then wanted to apply a patch like this on top, it would be simply an optimization that avoids unnecessary IOMMU mappings. > > I suspect the reason this patch makes a difference is because the > > current pci_for_each_dma_alias() believes one of those top-level > > bridges is an alias, and the iterator produces it last, so that's the > > one you map. The IOMMU is attached lower down, so that top-level > > bridge is not in fact an alias, but since you only look at the *last* > > one, you don't map the correct aliases from lower down in the tree. > > Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which > means that a PCI RC can multiple SMMUs, each handling a subset of RIDs. > > In the case of Cavium ThunderX2, the RID which we should see on the RC > - if we follow the standard and factor in the aliasing introduced by the > PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or > ITS). > > But, if we stop the traversal at the point where SMMU (or ITS) is > attached, we will get the correct RID as seen by these. There is a single "correct RID" for your builtin SATA and USB, but in general there is no single RID. > > Stopping the iterator earlier happens to make the last alias be one of > > the correct ones, but it doesn't solve the problems of quirked devices > > that can use multiple requester IDs, and it doesn't solve the problem > > of PCIe-to-PCI bridges that optionally take ownership of transactions. > > If these happen below the point where the SMMU is attached, we will > consider the last alias introduced, which should be ok. If they are > above, the alias introduced is not relevant. Devices with multiple > aliases is not handled anywhere in ARM code, so I don't think we should > consider that here. I think we *should* consider it here. The multiple alias situation is generic PCIe, independent of ARM. If you want to support arbitrary PCIe plugin devices, you have to handle it. I think any device with a quirk that calls pci_add_dma_alias() will currently fail on your system. And I think devices behind a bridge that optionally takes ownership of DMA transactions will also fail. Bjorn
On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote: > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote: > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > > > [+cc Joerg] > > > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > > > Hi Jayachandran, > > > > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > > > > > [node level PCI bridges - one per node] > > > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > > > [External PCI devices connected to PCIe links] > > > > > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > > > node level bridges do not introduce an alias either. > > > > > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > > > SoC PCI devices. > > > > > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > > > > > Can you supply some text here about why we want to apply this patch? > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > > > performance, avoid a crash, etc? > > > > > > > > If this is for the commit message, I hope the following is ok: > > > > > > > > "With this change, both MSI-X and IO virtualization work correctly on > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > > > devices, and the IOMMU groups are setup correctly." > > > > > > This doesn't get at what the actual problem is. I'm hoping for > > > something like "without this change, we set up an IOMMU mapping for > > > requestor ID X, but device DMA uses requestor ID Y because ...., which > > > results in an IOMMU fault" > > > > Ok. I hope this would be better: > > > > "Without this change, the last alias seen while traversing the PCI > > hierarchy will be used as the RID to generate the device ID for ITS > > and stream ID for SMMU. This in turn causes the MSI-X generated by the > > device to fail since the ITS expects to have translation tables based > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the > > device DMA also fails when SMMU is enabled due to incorrect value in > > SMMU translation tables" > > This description is true, but I don't think it addresses the real > problem. I think the real problem is that your IOMMU code doesn't > handle aliases correctly, and by ignoring these invalid aliases, we > happen to map an alias that works for the builtin devices. But that's > only because we got lucky (those devices use a single RID and they're > not behind bridges that optionally take ownership). > > It would make sense to me if we fixed the IOMMU code to map *all* the > aliases, which should be enough to make your devices work. If we then > wanted to apply a patch like this on top, it would be simply an > optimization that avoids unnecessary IOMMU mappings. The issue that the IOMMU code does not handle valid aliases is unrelated to what I am trying to fix. The quirk is to make sure that invalid aliases are not seen on ThunderX2 while doing pci_for_each_dma_alias(). The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point where the SMMU (or ITS) is attached, i.e. at the bridge marked with PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look for aliases above that point. The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since the on-chip devices are directly connected to the ITS (they do not use SMMU). The top two levels of bridges are not real PCI bridges but just PCI bridge-like things that were added to tie the whole hierarchy together for configuration and enumeration. They do not handle PCI/PCIe transactions in the traditional sense. I think my problem description is still not correct, maybe: "The SMMU (and ITS) expects to device tables to use the RID seen at the bridge they are associated with. Currently the pci_for_each_dma_alias() code traverses beyond this point and generates incorrect aliases due to the PCI and PCI/PCIe bridges above. This causes MSI-X interrupts and device DMA to fail since the SMMU and ITS tables to be setup with incorrect IDs" > > > I suspect the reason this patch makes a difference is because the > > > current pci_for_each_dma_alias() believes one of those top-level > > > bridges is an alias, and the iterator produces it last, so that's the > > > one you map. The IOMMU is attached lower down, so that top-level > > > bridge is not in fact an alias, but since you only look at the *last* > > > one, you don't map the correct aliases from lower down in the tree. > > > > Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which > > means that a PCI RC can multiple SMMUs, each handling a subset of RIDs. > > > > In the case of Cavium ThunderX2, the RID which we should see on the RC > > - if we follow the standard and factor in the aliasing introduced by the > > PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or > > ITS). > > > > But, if we stop the traversal at the point where SMMU (or ITS) is > > attached, we will get the correct RID as seen by these. > > There is a single "correct RID" for your builtin SATA and USB, but in > general there is no single RID. In case of external PCIe devices too, the RID (or aliases) seen to the point where SMMU or ITS is connected (i.e, the bridge marked with flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT) are correct. > > > Stopping the iterator earlier happens to make the last alias be one of > > > the correct ones, but it doesn't solve the problems of quirked devices > > > that can use multiple requester IDs, and it doesn't solve the problem > > > of PCIe-to-PCI bridges that optionally take ownership of transactions. > > > > If these happen below the point where the SMMU is attached, we will > > consider the last alias introduced, which should be ok. If they are > > above, the alias introduced is not relevant. Devices with multiple > > aliases is not handled anywhere in ARM code, so I don't think we should > > consider that here. > > I think we *should* consider it here. The multiple alias situation is > generic PCIe, independent of ARM. If you want to support arbitrary > PCIe plugin devices, you have to handle it. I think any device with a > quirk that calls pci_add_dma_alias() will currently fail on your > system. And I think devices behind a bridge that optionally takes > ownership of DMA transactions will also fail. The issue that IOMMU code does not handle all aliases has to be addressed separately (and Robin seems to be looking at this from his response). And even when that is addressed, this quirk is still needed on ThunderX2, as I pointed out above. Hope I am still not missing the point, and thanks for your patience here. JC.
On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote: > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote: > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote: > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > > > > [+cc Joerg] > > > > > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > > > > Hi Jayachandran, > > > > > > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > > > > > > > [node level PCI bridges - one per node] > > > > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > > > > [External PCI devices connected to PCIe links] > > > > > > > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > > > > node level bridges do not introduce an alias either. > > > > > > > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > > > > SoC PCI devices. > > > > > > > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > > > > > > > Can you supply some text here about why we want to apply this patch? > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > > > > performance, avoid a crash, etc? > > > > > > > > > > If this is for the commit message, I hope the following is ok: > > > > > > > > > > "With this change, both MSI-X and IO virtualization work correctly on > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > > > > devices, and the IOMMU groups are setup correctly." > > > > > > > > This doesn't get at what the actual problem is. I'm hoping for > > > > something like "without this change, we set up an IOMMU mapping for > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which > > > > results in an IOMMU fault" > > > > > > Ok. I hope this would be better: > > > > > > "Without this change, the last alias seen while traversing the PCI > > > hierarchy will be used as the RID to generate the device ID for ITS > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the > > > device to fail since the ITS expects to have translation tables based > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the > > > device DMA also fails when SMMU is enabled due to incorrect value in > > > SMMU translation tables" > > > > This description is true, but I don't think it addresses the real > > problem. I think the real problem is that your IOMMU code doesn't > > handle aliases correctly, and by ignoring these invalid aliases, we > > happen to map an alias that works for the builtin devices. But that's > > only because we got lucky (those devices use a single RID and they're > > not behind bridges that optionally take ownership). > > > > It would make sense to me if we fixed the IOMMU code to map *all* the > > aliases, which should be enough to make your devices work. If we then > > wanted to apply a patch like this on top, it would be simply an > > optimization that avoids unnecessary IOMMU mappings. > > The issue that the IOMMU code does not handle valid aliases is > unrelated to what I am trying to fix. The quirk is to make sure > that invalid aliases are not seen on ThunderX2 while doing > pci_for_each_dma_alias(). > > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point > where the SMMU (or ITS) is attached, i.e. at the bridge marked with > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look > for aliases above that point. > > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since > the on-chip devices are directly connected to the ITS (they do not > use SMMU). > > The top two levels of bridges are not real PCI bridges but just > PCI bridge-like things that were added to tie the whole hierarchy > together for configuration and enumeration. They do not handle > PCI/PCIe transactions in the traditional sense. > > I think my problem description is still not correct, maybe: > "The SMMU (and ITS) expects to device tables to use the RID seen > at the bridge they are associated with. Currently the > pci_for_each_dma_alias() code traverses beyond this point and > generates incorrect aliases due to the PCI and PCI/PCIe bridges > above. This causes MSI-X interrupts and device DMA to fail since > the SMMU and ITS tables to be setup with incorrect IDs" I haven't tried to figure out the MSI-X piece of this, but let me try to come up with a concrete DMA example. Assume this topology: 00:00.0 bridge to [bus 01-1e] 01:0a.0 bridge to [bus 04-05] 04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT) 05:00.0 endpoint Assume 05:00.0 generates a DMA request. Assume the top two bridges are such that pci_for_each_dma_alias() includes them as well, so it iterates through 05:00.0, 01:0a.0, and 00:00.0. When the driver for 05:00.0 makes a DMA mapping, the current code apparently makes an IOMMU mapping for requester ID 00:00.0 because that's the last alias. Obviously this doesn't work because the IOMMU at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0. With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an IOMMU mapping for requester ID 05:00.0, which will work fine. I think it would *also* work fine if we made IOMMU mappings for 05:00.0, 01:0a.0, and 00:0.0. The last two are unnecessary, but probably not harmful. Now assume 05:00 is a multi-function device that has a DMA alias quirk, e.g., see quirk_dma_func0_alias(). It has another function: 05:00.3 endpoint DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0. The driver makes a DMA mapping, pci_for_each_dma_alias() iterates through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only 00:00.0, which again doesn't work. With this quirk, we create a single mapping for 05:00.0. That will work sometimes, but the device may also generate DMA with a requester ID of 05:00.3, and that won't work. If your on-chip device is, e.g., 01:04.0, pci_for_each_dma_alias() probably iterates through 01:04.0, 00:00.0. Today we make an IOMMU mapping for 00:00.0, which doesn't work. With this quirk, we'll ignore 00:00.0 and make a mapping for 01:04.0, which does work. But I think if you made the IOMMU code add mappings for each of the aliases, i.e., for both 01:04.0 and 00:00.0, that device *would* work even without this quirk. Bjorn
On Wed, Apr 12, 2017 at 02:11:38PM -0500, Bjorn Helgaas wrote: > On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote: > > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote: > > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote: > > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > > > > > [+cc Joerg] > > > > > > > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > > > > > Hi Jayachandran, > > > > > > > > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > > > > > > > > > [node level PCI bridges - one per node] > > > > > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > > > > > [External PCI devices connected to PCIe links] > > > > > > > > > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > > > > > node level bridges do not introduce an alias either. > > > > > > > > > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > > > > > SoC PCI devices. > > > > > > > > > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > > > > > > > > > Can you supply some text here about why we want to apply this patch? > > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > > > > > performance, avoid a crash, etc? > > > > > > > > > > > > If this is for the commit message, I hope the following is ok: > > > > > > > > > > > > "With this change, both MSI-X and IO virtualization work correctly on > > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > > > > > devices, and the IOMMU groups are setup correctly." > > > > > > > > > > This doesn't get at what the actual problem is. I'm hoping for > > > > > something like "without this change, we set up an IOMMU mapping for > > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which > > > > > results in an IOMMU fault" > > > > > > > > Ok. I hope this would be better: > > > > > > > > "Without this change, the last alias seen while traversing the PCI > > > > hierarchy will be used as the RID to generate the device ID for ITS > > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the > > > > device to fail since the ITS expects to have translation tables based > > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the > > > > device DMA also fails when SMMU is enabled due to incorrect value in > > > > SMMU translation tables" > > > > > > This description is true, but I don't think it addresses the real > > > problem. I think the real problem is that your IOMMU code doesn't > > > handle aliases correctly, and by ignoring these invalid aliases, we > > > happen to map an alias that works for the builtin devices. But that's > > > only because we got lucky (those devices use a single RID and they're > > > not behind bridges that optionally take ownership). > > > > > > It would make sense to me if we fixed the IOMMU code to map *all* the > > > aliases, which should be enough to make your devices work. If we then > > > wanted to apply a patch like this on top, it would be simply an > > > optimization that avoids unnecessary IOMMU mappings. > > > > The issue that the IOMMU code does not handle valid aliases is > > unrelated to what I am trying to fix. The quirk is to make sure > > that invalid aliases are not seen on ThunderX2 while doing > > pci_for_each_dma_alias(). > > > > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point > > where the SMMU (or ITS) is attached, i.e. at the bridge marked with > > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look > > for aliases above that point. > > > > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since > > the on-chip devices are directly connected to the ITS (they do not > > use SMMU). > > > > The top two levels of bridges are not real PCI bridges but just > > PCI bridge-like things that were added to tie the whole hierarchy > > together for configuration and enumeration. They do not handle > > PCI/PCIe transactions in the traditional sense. > > > > I think my problem description is still not correct, maybe: > > "The SMMU (and ITS) expects to device tables to use the RID seen > > at the bridge they are associated with. Currently the > > pci_for_each_dma_alias() code traverses beyond this point and > > generates incorrect aliases due to the PCI and PCI/PCIe bridges > > above. This causes MSI-X interrupts and device DMA to fail since > > the SMMU and ITS tables to be setup with incorrect IDs" > > I haven't tried to figure out the MSI-X piece of this, but let me try > to come up with a concrete DMA example. Assume this topology: > > 00:00.0 bridge to [bus 01-1e] > 01:0a.0 bridge to [bus 04-05] > 04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT) > 05:00.0 endpoint > > Assume 05:00.0 generates a DMA request. Assume the top two bridges > are such that pci_for_each_dma_alias() includes them as well, so it > iterates through 05:00.0, 01:0a.0, and 00:00.0. > > When the driver for 05:00.0 makes a DMA mapping, the current code > apparently makes an IOMMU mapping for requester ID 00:00.0 because > that's the last alias. Obviously this doesn't work because the IOMMU > at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0. > > With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an > IOMMU mapping for requester ID 05:00.0, which will work fine. I think > it would *also* work fine if we made IOMMU mappings for 05:00.0, > 01:0a.0, and 00:0.0. The last two are unnecessary, but probably not > harmful. Ok. The last two are not harmful but still incorrect. The bridges 0:0.0 and 1:a.0 are outside the path with the PCI transactions takes (they go from 4:0.0 to the SMMU). > Now assume 05:00 is a multi-function device that has a DMA alias > quirk, e.g., see quirk_dma_func0_alias(). It has another function: > > 05:00.3 endpoint > > DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0. > The driver makes a DMA mapping, pci_for_each_dma_alias() iterates > through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only > 00:00.0, which again doesn't work. > > With this quirk, we create a single mapping for 05:00.0. That will > work sometimes, but the device may also generate DMA with a requester > ID of 05:00.3, and that won't work. Note that here, pci_for_each_dma_alias() works correctly with the quirk and incorrectly without the quirk. With the quirk, the callback function is involved on 05:00.3 and 05:00.0 as expected. Without the quirk the call back is invoked on 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, which includes aliases that are not valid. The idea behind the quirk is to get pci_for_each_dma_alias work correctly on ThunderX2, and invoke the function only on the valid aliases. With the quirk, the code using it - like the SMMU code - gets the correct aliases to process, and do not have to deal with or filter out incorrect aliases. I don't think it would be safe to assume that all callers of pci_for_each_dma_alias will be ok to handle invalid aliases. Even in the IOMMU case, the case which calculates stream IDs is one usage, the usage for iommu groups also did not deal with invalid aliases correctly when I tried it. Then there are the MSI cases too, I had done some work on trying to filter out invalid aliases in the relevant code[1], but in my opinion, getting pci_for_each_dma_alias() do the right thing is the correct solution. > If your on-chip device is, e.g., 01:04.0, pci_for_each_dma_alias() > probably iterates through 01:04.0, 00:00.0. Today we make an IOMMU > mapping for 00:00.0, which doesn't work. With this quirk, we'll > ignore 00:00.0 and make a mapping for 01:04.0, which does work. But I > think if you made the IOMMU code add mappings for each of the aliases, > i.e., for both 01:04.0 and 00:00.0, that device *would* work even > without this quirk. There is no IOMMU for the on-chip devices, but the MSI-X interrupts are looked up in a table based on DeviceID(RID) by the ARM interrupt controller. The issue here is similar. JC. [1]https://www.spinics.net/lists/arm-kernel/msg573744.html
On Wed, Apr 12, 2017 at 08:41:20PM +0000, Jayachandran C wrote: > On Wed, Apr 12, 2017 at 02:11:38PM -0500, Bjorn Helgaas wrote: > > On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote: > > > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote: > > > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote: > > > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: > > > > > > [+cc Joerg] > > > > > > > > > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote: > > > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote: > > > > > > > > Hi Jayachandran, > > > > > > > > > > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote: > > > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI > > > > > > > > > topology is slightly unusual. For a multi-node system, it looks like: > > > > > > > > > > > > > > > > > > [node level PCI bridges - one per node] > > > > > > > > > [SoC PCI devices with MSI-X but no IOMMU] > > > > > > > > > [PCI-PCIe "glue" bridges - upto 14, one per real port below] > > > > > > > > > [PCIe real root ports associated with IOMMU and GICv3 ITS] > > > > > > > > > [External PCI devices connected to PCIe links] > > > > > > > > > > > > > > > > > > The top two levels of bridges should have introduced aliases since they > > > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. > > > > > > > > > In the case of external PCIe devices, the "real" root ports are connected > > > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an > > > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the > > > > > > > > > node level bridges do not introduce an alias either. > > > > > > > > > > > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level > > > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, > > > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and > > > > > > > > > SoC PCI devices. > > > > > > > > > > > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID > > > > > > > > > are from Broadcom Vulcan (14e4:90XX). > > > > > > > > > > > > > > > > Can you supply some text here about why we want to apply this patch? > > > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve > > > > > > > > performance, avoid a crash, etc? > > > > > > > > > > > > > > If this is for the commit message, I hope the following is ok: > > > > > > > > > > > > > > "With this change, both MSI-X and IO virtualization work correctly on > > > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to > > > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI > > > > > > > devices, and the IOMMU groups are setup correctly." > > > > > > > > > > > > This doesn't get at what the actual problem is. I'm hoping for > > > > > > something like "without this change, we set up an IOMMU mapping for > > > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which > > > > > > results in an IOMMU fault" > > > > > > > > > > Ok. I hope this would be better: > > > > > > > > > > "Without this change, the last alias seen while traversing the PCI > > > > > hierarchy will be used as the RID to generate the device ID for ITS > > > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the > > > > > device to fail since the ITS expects to have translation tables based > > > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the > > > > > device DMA also fails when SMMU is enabled due to incorrect value in > > > > > SMMU translation tables" > > > > > > > > This description is true, but I don't think it addresses the real > > > > problem. I think the real problem is that your IOMMU code doesn't > > > > handle aliases correctly, and by ignoring these invalid aliases, we > > > > happen to map an alias that works for the builtin devices. But that's > > > > only because we got lucky (those devices use a single RID and they're > > > > not behind bridges that optionally take ownership). > > > > > > > > It would make sense to me if we fixed the IOMMU code to map *all* the > > > > aliases, which should be enough to make your devices work. If we then > > > > wanted to apply a patch like this on top, it would be simply an > > > > optimization that avoids unnecessary IOMMU mappings. > > > > > > The issue that the IOMMU code does not handle valid aliases is > > > unrelated to what I am trying to fix. The quirk is to make sure > > > that invalid aliases are not seen on ThunderX2 while doing > > > pci_for_each_dma_alias(). > > > > > > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point > > > where the SMMU (or ITS) is attached, i.e. at the bridge marked with > > > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look > > > for aliases above that point. > > > > > > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since > > > the on-chip devices are directly connected to the ITS (they do not > > > use SMMU). > > > > > > The top two levels of bridges are not real PCI bridges but just > > > PCI bridge-like things that were added to tie the whole hierarchy > > > together for configuration and enumeration. They do not handle > > > PCI/PCIe transactions in the traditional sense. > > > > > > I think my problem description is still not correct, maybe: > > > "The SMMU (and ITS) expects to device tables to use the RID seen > > > at the bridge they are associated with. Currently the > > > pci_for_each_dma_alias() code traverses beyond this point and > > > generates incorrect aliases due to the PCI and PCI/PCIe bridges > > > above. This causes MSI-X interrupts and device DMA to fail since > > > the SMMU and ITS tables to be setup with incorrect IDs" > > > > I haven't tried to figure out the MSI-X piece of this, but let me try > > to come up with a concrete DMA example. Assume this topology: > > > > 00:00.0 bridge to [bus 01-1e] > > 01:0a.0 bridge to [bus 04-05] > > 04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT) > > 05:00.0 endpoint > > > > Assume 05:00.0 generates a DMA request. Assume the top two bridges > > are such that pci_for_each_dma_alias() includes them as well, so it > > iterates through 05:00.0, 01:0a.0, and 00:00.0. > > > > When the driver for 05:00.0 makes a DMA mapping, the current code > > apparently makes an IOMMU mapping for requester ID 00:00.0 because > > that's the last alias. Obviously this doesn't work because the IOMMU > > at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0. > > > > With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an > > IOMMU mapping for requester ID 05:00.0, which will work fine. I think > > it would *also* work fine if we made IOMMU mappings for 05:00.0, > > 01:0a.0, and 00:0.0. The last two are unnecessary, but probably not > > harmful. > > Ok. The last two are not harmful but still incorrect. The bridges > 0:0.0 and 1:a.0 are outside the path with the PCI transactions takes > (they go from 4:0.0 to the SMMU). > > > Now assume 05:00 is a multi-function device that has a DMA alias > > quirk, e.g., see quirk_dma_func0_alias(). It has another function: > > > > 05:00.3 endpoint > > > > DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0. > > The driver makes a DMA mapping, pci_for_each_dma_alias() iterates > > through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only > > 00:00.0, which again doesn't work. > > > > With this quirk, we create a single mapping for 05:00.0. That will > > work sometimes, but the device may also generate DMA with a requester > > ID of 05:00.3, and that won't work. > > Note that here, pci_for_each_dma_alias() works correctly with the quirk > and incorrectly without the quirk. With the quirk, the callback function > is involved on 05:00.3 and 05:00.0 as expected. With the quirk, pci_for_each_dma_alias() works correctly and iterates through 05:00.3 and 05:00.0. But the IOMMU code only pay attention to the *last* alias, i.e., 05:00.0. So this does not work. > Without the quirk the > call back is invoked on 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, which > includes aliases that are not valid. > > The idea behind the quirk is to get pci_for_each_dma_alias work correctly > on ThunderX2, and invoke the function only on the valid aliases. With the > quirk, the code using it - like the SMMU code - gets the correct aliases > to process, and do not have to deal with or filter out incorrect aliases. > > I don't think it would be safe to assume that all callers of > pci_for_each_dma_alias will be ok to handle invalid aliases. Even in > the IOMMU case, the case which calculates stream IDs is one usage, the > usage for iommu groups also did not deal with invalid aliases correctly > when I tried it. Then there are the MSI cases too, I had done some work > on trying to filter out invalid aliases in the relevant code[1], but > in my opinion, getting pci_for_each_dma_alias() do the right thing is > the correct solution. I agree, we should fix this and it sounds like it's more than just an optimization. But I don't think the quirk is a complete fix, and the changelog needs a short sketch of this discussion to make it clear that this happens to fix some DMA faults, but others remain because the current IOMMU code only maps one of the valid aliases. So I think the only thing we need to move forward on this is a revised changelog. I propose something like this: On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI topology is slightly unusual. For a multi-node system, it looks like: 00:00.0 [?? type] bridge to [bus 01-1e] 01:0a.0 [?? type] bridge to [bus 04-05] 04:00.0 [?? type] bridge to [bus 05] (ThunderX2 XLATE_ROOT) 05:00.0 endpoint pci_for_each_dma_alias() assumes IOMMU translation is done at the root of the PCI hierarchy. It generates 05:00.0 and <BB:DD.F> as DMA aliases for 05:00.0 because bus <BB> is a non-PCIe bus that doesn't carry the Requester ID. Because the ThunderX2 IOMMU is at 04:00.0, <BB:DD.F> is never a valid Requester ID. This quirk stops alias generation at the XLATE_ROOT bridge so we won't generate <BB:DD.F>. The current IOMMU code only maps the last alias (this is a separate bug in itself). Prior to this quirk, we only created IOMMU mappings for the invalid Requester ID <BB:DD.F>, which never matched any DMA transactions. With this quirk, we create IOMMU mappings for a valid Requester ID, which fixes devices with no aliases but leaves devices with aliases still broken. I don't know the details of what type of bridges those are (conventional PCI? PCIe PCI-to-PCIe? etc?) and exactly what RIDs are involved. It'd be nice if you could fill those in. Bjorn
On 04/04/2017 10:28 AM, Robin Murphy wrote:
> So (at the risk of Jon mooing at me)
moooooooooooo
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6736836..564a84a 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias); /* + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are + * associated not at the root bus, but at a bridge below. This quirk flag + * will ensure that the aliases are identified correctly. + */ +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev) +{ + pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT; +} +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000, + quirk_bridge_cavm_thrx2_pcie_root); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084, + quirk_bridge_cavm_thrx2_pcie_root); + +/* * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero) * class code. Fix it. */
The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI topology is slightly unusual. For a multi-node system, it looks like: [node level PCI bridges - one per node] [SoC PCI devices with MSI-X but no IOMMU] [PCI-PCIe "glue" bridges - upto 14, one per real port below] [PCIe real root ports associated with IOMMU and GICv3 ITS] [External PCI devices connected to PCIe links] The top two levels of bridges should have introduced aliases since they are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not. In the case of external PCIe devices, the "real" root ports are connected to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an alias. The SoC PCI devices are directly connected to the GIC ITS, so the node level bridges do not introduce an alias either. To handle this quirk, we mark the real PCIe root ports and node level PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. With this, pci_for_each_dma_alias() works correctly for external PCIe devices and SoC PCI devices. For the current revision of Cavium ThunderX2, the VendorID and Device ID are from Broadcom Vulcan (14e4:90XX). Signed-off-by: Jayachandran C <jnair@caviumnetworks.com> --- drivers/pci/quirks.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)