Message ID | 20200831151827.pumm2p54fyj7fz5s@amazon.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | arm64: Enable PCI write-combine resources under sysfs | expand |
On Mon, 2020-08-31 at 15:18 +0000, Clint Sbisa wrote: > Using write-combine is crucial for performance of PCI devices where > significant amounts of transactions go over PCI BARs. > > arm64 supports write-combine PCI mappings, so the appropriate define > has been added which will expose write-combine mappings under sysfs > for prefetchable PCI resources. > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> > --- > arch/arm64/include/asm/pci.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/include/asm/pci.h > b/arch/arm64/include/asm/pci.h > index 70b323cf8300..b33ca260e3c9 100644 > --- a/arch/arm64/include/asm/pci.h > +++ b/arch/arm64/include/asm/pci.h > @@ -17,6 +17,7 @@ > #define pcibios_assign_all_busses() \ > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > +#define arch_can_pci_mmap_wc() 1 > #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1 > > extern int isa_dma_bridge_buggy;
On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > Using write-combine is crucial for performance of PCI devices where > significant amounts of transactions go over PCI BARs. > > arm64 supports write-combine PCI mappings, so the appropriate define > has been added which will expose write-combine mappings under sysfs > for prefetchable PCI resources. > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> Fine with me, I assume Will or Catalin will apply this. > --- > arch/arm64/include/asm/pci.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > index 70b323cf8300..b33ca260e3c9 100644 > --- a/arch/arm64/include/asm/pci.h > +++ b/arch/arm64/include/asm/pci.h > @@ -17,6 +17,7 @@ > #define pcibios_assign_all_busses() \ > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > +#define arch_can_pci_mmap_wc() 1 > #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1 > > extern int isa_dma_bridge_buggy; > -- > 2.23.3 >
On Tue, 2020-09-01 at 13:37 -0500, Bjorn Helgaas wrote: > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > Using write-combine is crucial for performance of PCI devices where > > significant amounts of transactions go over PCI BARs. > > > > arm64 supports write-combine PCI mappings, so the appropriate > > define > > has been added which will expose write-combine mappings under sysfs > > for prefetchable PCI resources. > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > Fine with me, I assume Will or Catalin will apply this. Haha ! Client had sent it to them originally and I told him to resend it to linux-pci, yourself and Lorenzo :-) So the confusion is on me. Will, Catalin, it's all yours. You should have the original patch in your mbox already, otherwise: https://patchwork.kernel.org/patch/11729875/ Cheers, Ben.
On Wed, Sep 02, 2020 at 09:22:53AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-01 at 13:37 -0500, Bjorn Helgaas wrote: > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > > Using write-combine is crucial for performance of PCI devices where > > > significant amounts of transactions go over PCI BARs. > > > > > > arm64 supports write-combine PCI mappings, so the appropriate > > > define > > > has been added which will expose write-combine mappings under sysfs > > > for prefetchable PCI resources. > > > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > > > Fine with me, I assume Will or Catalin will apply this. > > Haha ! Client had sent it to them originally and I told him to resend > it to linux-pci, yourself and Lorenzo :-) > > So the confusion is on me. > > Will, Catalin, it's all yours. You should have the original patch in > your mbox already, otherwise: > > https://patchwork.kernel.org/patch/11729875/ Yup, it's not the radar. We don't usually start queuing stuff until -rc3, so I'm working through the backlog this week. Would like an Ack from Lorenzo, though. Will
On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > Using write-combine is crucial for performance of PCI devices where > significant amounts of transactions go over PCI BARs. Write-combine is an x86ism that means nothing on ARM64 platforms so this should be rewritten to say what you actually mean, namely, you want to allow prefetchable resources to be mapped with "write combine" semantics (which means normal non-cacheable memory on arm64) through proc/sysfs. This is an outright can of worms and the PCI specs don't help in this respect, since we may end up mapping resources that have read side-effects with normal NC mappings (ie that's what "write combine" is in arm64 - pgprot_writecombine() and that's speculative memory). I am referring to "Additional Guidance on the Prefetchable Bit in Memory Space BARs" in the PCI specifications - it does not make any sense and must be removed because people use it to design endpoints. True - this is a problem even in kernel drivers but at least there the ioremap_ semantics is in the driver and can be vetted. This patch would make it user space ABI so I am a little nervous about merging this code TBH. > arm64 supports write-combine PCI mappings, so the appropriate define > has been added which will expose write-combine mappings under sysfs > for prefetchable PCI resources. > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > --- > arch/arm64/include/asm/pci.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > index 70b323cf8300..b33ca260e3c9 100644 > --- a/arch/arm64/include/asm/pci.h > +++ b/arch/arm64/include/asm/pci.h > @@ -17,6 +17,7 @@ > #define pcibios_assign_all_busses() \ > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > +#define arch_can_pci_mmap_wc() 1 I am not comfortable with this blanket enable. Some existing drivers, eg: drivers/infiniband/hw/mlx5 use this macro to detect WC capability which again, it is x86 specific, on arm64 it means nothing and can have consequences on the driver operations. Thanks, Lorenzo > #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1 > > extern int isa_dma_bridge_buggy; > -- > 2.23.3 >
On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote: > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > Using write-combine is crucial for performance of PCI devices where > > significant amounts of transactions go over PCI BARs. > > Write-combine is an x86ism that means nothing on ARM64 platforms > so this should be rewritten to say what you actually mean, namely, > you want to allow prefetchable resources to be mapped with > "write combine" semantics (which means normal non-cacheable > memory on arm64) through proc/sysfs. > > This is an outright can of worms and the PCI specs don't help in this > respect, since we may end up mapping resources that have read > side-effects with normal NC mappings (ie that's what "write combine" is > in arm64 - pgprot_writecombine() and that's speculative memory). > > I am referring to "Additional Guidance on the Prefetchable Bit > in Memory Space BARs" in the PCI specifications - it does not make > any sense and must be removed because people use it to design > endpoints. > > True - this is a problem even in kernel drivers but at least there > the ioremap_ semantics is in the driver and can be vetted. > > This patch would make it user space ABI so I am a little nervous > about merging this code TBH. > User space applications are utilizing WC already. You can see DPDK code using resourceX_wc over the usual resourceX file at https://github.com/DPDK/dpdk/blob/main/drivers/bus/pci/linux/pci_uio.c#L312 (commit https://github.com/DPDK/dpdk/commit/4a928ef9f61). Given that write-combine support was added in 2008 for x86 (and is also enabled for powerpc and ia64), I'm not sure if there'd be a downside to enabling it on arm64 as well given how prevalent it is. Lorenzo, do you still have particular concerns about exposing this to userspace? I understand that my commit message is outright wrong after your explanation, so I'll rewrite that. > > arm64 supports write-combine PCI mappings, so the appropriate define > > has been added which will expose write-combine mappings under sysfs > > for prefetchable PCI resources. > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > --- > > arch/arm64/include/asm/pci.h | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > > index 70b323cf8300..b33ca260e3c9 100644 > > --- a/arch/arm64/include/asm/pci.h > > +++ b/arch/arm64/include/asm/pci.h > > @@ -17,6 +17,7 @@ > > #define pcibios_assign_all_busses() \ > > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > > > +#define arch_can_pci_mmap_wc() 1 > > I am not comfortable with this blanket enable. Some existing drivers, > eg: > > drivers/infiniband/hw/mlx5 > > use this macro to detect WC capability which again, it is x86 specific, > on arm64 it means nothing and can have consequences on the driver > operations. If that driver is fixed to check what it actually wants to check, would that address your concern about the blanket enable? I don't see any other references to this in kernel drivers and I think the documentation at `filesystems/sysfs-pci.rst` outlines it pretty explicitly: Platforms which support write-combining maps of PCI resources must define arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when write-combining is permitted. It is otherwise only used by pci-sysfs to determine if a resourceX_wc file should be exposed. Thanks, Clint > > Thanks, > Lorenzo > > > #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1 > > > > extern int isa_dma_bridge_buggy; > > -- > > 2.23.3 > >
[+LAKML, Will, Catalin] This is an ARM64 arch discussion so please keep the above CCed from now onwards. On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote: > On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote: > > > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > > Using write-combine is crucial for performance of PCI devices where > > > significant amounts of transactions go over PCI BARs. > > > > Write-combine is an x86ism that means nothing on ARM64 platforms > > so this should be rewritten to say what you actually mean, namely, > > you want to allow prefetchable resources to be mapped with > > "write combine" semantics (which means normal non-cacheable > > memory on arm64) through proc/sysfs. > > > > This is an outright can of worms and the PCI specs don't help in this > > respect, since we may end up mapping resources that have read > > side-effects with normal NC mappings (ie that's what "write combine" is > > in arm64 - pgprot_writecombine() and that's speculative memory). > > > > I am referring to "Additional Guidance on the Prefetchable Bit > > in Memory Space BARs" in the PCI specifications - it does not make > > any sense and must be removed because people use it to design > > endpoints. > > > > True - this is a problem even in kernel drivers but at least there > > the ioremap_ semantics is in the driver and can be vetted. > > > > This patch would make it user space ABI so I am a little nervous > > about merging this code TBH. > > > > User space applications are utilizing WC already. You can see DPDK code using > resourceX_wc over the usual resourceX file at > https://github.com/DPDK/dpdk/blob/main/drivers/bus/pci/linux/pci_uio.c#L312 > (commit https://github.com/DPDK/dpdk/commit/4a928ef9f61). I know. > Given that write-combine support was added in 2008 for x86 (and is > also enabled for powerpc and ia64), I'm not sure if there'd be a > downside to enabling it on arm64 as well given how prevalent it is. I explained to you the reasons why this can have downsides and write combine is a concept that does not exist in the ARM64 world, actually it would be good if Ben can chime in to define how this works on power. >Lorenzo, do you still have particular concerns about exposing this to >userspace? Yes I do and I expressed them. The first concern is the WC ambiguity on non-x86 systems, it looks like write combinining means everything and nothing at the same time on != x86 arches. On x86 prefetchable BAR == WC mapping (still conditional on arch features ie PAT, not a blanket enable). On ARM64 WC mapping currently corresponds to normal NC memory and the PCIe specs allow read side-effects BAR to be marked as prefetchable, I need to force PCI sig to remove the section I mentioned from the specifications because there is NO way it can be detected if a prefetchable BAR maps to read side-effects memory. A kernel device driver would (hopefully) know, sysfs code that just checks the prefetchable attribute and exports resource_WC does not. As I mentioned, if the mapping is done in a device specific driver it can be vetted and there are not many drivers mapping BARs as ioremap_wc(). > I understand that my commit message is outright wrong after your explanation, > so I'll rewrite that. > > > > arm64 supports write-combine PCI mappings, so the appropriate define > > > has been added which will expose write-combine mappings under sysfs > > > for prefetchable PCI resources. > > > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > > --- > > > arch/arm64/include/asm/pci.h | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > > > index 70b323cf8300..b33ca260e3c9 100644 > > > --- a/arch/arm64/include/asm/pci.h > > > +++ b/arch/arm64/include/asm/pci.h > > > @@ -17,6 +17,7 @@ > > > #define pcibios_assign_all_busses() \ > > > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > > > > > +#define arch_can_pci_mmap_wc() 1 > > > > I am not comfortable with this blanket enable. Some existing drivers, > > eg: > > > > drivers/infiniband/hw/mlx5 > > > > use this macro to detect WC capability which again, it is x86 specific, > > on arm64 it means nothing and can have consequences on the driver > > operations. > > If that driver is fixed to check what it actually wants to check, would that > address your concern about the blanket enable? I don't see any other references > to this in kernel drivers and I think the documentation at > `filesystems/sysfs-pci.rst` outlines it pretty explicitly: > > Platforms which support write-combining maps of PCI resources must define > arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when > write-combining is permitted. That's exactly the problem. I am asking you: what does "write-combining maps of PCI resources" mean ? I understand we do want weak ordering for prefetchable BAR mappings but my worry is that by exposing the resources as WC to user space we are giving user space the impression that those mappings mirror x86 WC mappings behaviour that is not true on ARM64. Again - Ben has extensive experience on this on power I would be happy to get his point of view before proceeding, it is important to undestand how this work on non-x86 systems. > It is otherwise only used by pci-sysfs to determine if a resourceX_wc > file should be exposed. I know - that's understood. Thanks, Lorenzo
On Wed, Sep 02, 2020 at 05:47:02PM +0100, Lorenzo Pieralisi wrote: > On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote: > > On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote: > > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > > > arm64 supports write-combine PCI mappings, so the appropriate define > > > > has been added which will expose write-combine mappings under sysfs > > > > for prefetchable PCI resources. > > > > > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > > > --- > > > > arch/arm64/include/asm/pci.h | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > > > > index 70b323cf8300..b33ca260e3c9 100644 > > > > --- a/arch/arm64/include/asm/pci.h > > > > +++ b/arch/arm64/include/asm/pci.h > > > > @@ -17,6 +17,7 @@ > > > > #define pcibios_assign_all_busses() \ > > > > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > > > > > > > +#define arch_can_pci_mmap_wc() 1 > > > > > > I am not comfortable with this blanket enable. Some existing drivers, > > > eg: > > > > > > drivers/infiniband/hw/mlx5 > > > > > > use this macro to detect WC capability which again, it is x86 specific, > > > on arm64 it means nothing and can have consequences on the driver > > > operations. > > > > If that driver is fixed to check what it actually wants to check, would that > > address your concern about the blanket enable? I don't see any other references > > to this in kernel drivers and I think the documentation at > > `filesystems/sysfs-pci.rst` outlines it pretty explicitly: > > > > Platforms which support write-combining maps of PCI resources must define > > arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when > > write-combining is permitted. > > That's exactly the problem. I am asking you: what does "write-combining > maps of PCI resources" mean ? > > I understand we do want weak ordering for prefetchable BAR mappings > but my worry is that by exposing the resources as WC to user space > we are giving user space the impression that those mappings mirror > x86 WC mappings behaviour that is not true on ARM64. Would Device_GRE be close to the x86 WC better? It won't allow unaligned accesses and that can be problematic for the user. OTOH, it doesn't speculate reads, so it's safer from the hardware perspective.
On Wed, Sep 02, 2020 at 06:21:57PM +0100, Catalin Marinas wrote: > On Wed, Sep 02, 2020 at 05:47:02PM +0100, Lorenzo Pieralisi wrote: > > On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote: > > > On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote: > > > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote: > > > > > arm64 supports write-combine PCI mappings, so the appropriate define > > > > > has been added which will expose write-combine mappings under sysfs > > > > > for prefetchable PCI resources. > > > > > > > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com> > > > > > --- > > > > > arch/arm64/include/asm/pci.h | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h > > > > > index 70b323cf8300..b33ca260e3c9 100644 > > > > > --- a/arch/arm64/include/asm/pci.h > > > > > +++ b/arch/arm64/include/asm/pci.h > > > > > @@ -17,6 +17,7 @@ > > > > > #define pcibios_assign_all_busses() \ > > > > > (pci_has_flag(PCI_REASSIGN_ALL_BUS)) > > > > > > > > > > +#define arch_can_pci_mmap_wc() 1 > > > > > > > > I am not comfortable with this blanket enable. Some existing drivers, > > > > eg: > > > > > > > > drivers/infiniband/hw/mlx5 > > > > > > > > use this macro to detect WC capability which again, it is x86 specific, > > > > on arm64 it means nothing and can have consequences on the driver > > > > operations. > > > > > > If that driver is fixed to check what it actually wants to check, would that > > > address your concern about the blanket enable? I don't see any other references > > > to this in kernel drivers and I think the documentation at > > > `filesystems/sysfs-pci.rst` outlines it pretty explicitly: > > > > > > Platforms which support write-combining maps of PCI resources must define > > > arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when > > > write-combining is permitted. > > > > That's exactly the problem. I am asking you: what does "write-combining > > maps of PCI resources" mean ? > > > > I understand we do want weak ordering for prefetchable BAR mappings > > but my worry is that by exposing the resources as WC to user space > > we are giving user space the impression that those mappings mirror > > x86 WC mappings behaviour that is not true on ARM64. > > Would Device_GRE be close to the x86 WC better? It won't allow unaligned > accesses and that can be problematic for the user. OTOH, it doesn't > speculate reads, so it's safer from the hardware perspective. Thanks Catalin for chiming in, it may yes but I need to figure out the precise semantics of WC on x86 first. Actually *if* I read x86 specs correctly WC mappings allow speculative reads, which then would shift the issue on the PCI specs that allow marking read side effects BARs as prefetchable; in other words if an endpoint is designed with a prefetchable BAR that has read side effects this is already an issue on x86 in the current kernel. There is that, plus the usage of arch_can_pci_mmap_wc() in mellanox drivers which I suspect it is yet another interpretation of x86 write combine - I don't know what happens if we let arch_can_pci_mmap_wc() == 1 on both normalNC or deviceGRE mappings for pgprot_writecombine. I think it is worth getting to the bottom of this before applying this patch. Thanks, Lorenzo
On Wed, 2020-09-02 at 18:54 +0100, Lorenzo Pieralisi wrote: > > > > If that driver is fixed to check what it actually wants to check, would that > > > > address your concern about the blanket enable? I don't see any other references > > > > to this in kernel drivers and I think the documentation at > > > > `filesystems/sysfs-pci.rst` outlines it pretty explicitly: > > > > > > > > Platforms which support write-combining maps of PCI resources must define > > > > arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when > > > > write-combining is permitted. > > > > > > That's exactly the problem. I am asking you: what does "write-combining > > > maps of PCI resources" mean ? > > > > > > I understand we do want weak ordering for prefetchable BAR mappings > > > but my worry is that by exposing the resources as WC to user space > > > we are giving user space the impression that those mappings mirror > > > x86 WC mappings behaviour that is not true on ARM64. > > > > Would Device_GRE be close to the x86 WC better? It won't allow unaligned > > accesses and that can be problematic for the user. OTOH, it doesn't > > speculate reads, so it's safer from the hardware perspective. > > Thanks Catalin for chiming in, it may yes but I need to figure out > the precise semantics of WC on x86 first. We never got to the bottom of that with powerpc... semantics of "WC" are subtly different all over the archs. They key idea I think is for us to state that a WC mapping drops all ordering guarantees :-) That said, the goal here is to expose the sysfs _wc files, without which, mapping of "no-side-effect" memory such as frame buffers etc... produces something very very slow. > Actually *if* I read x86 specs correctly WC mappings allow speculative > reads, which then would shift the issue on the PCI specs that allow > marking read side effects BARs as prefetchable; Yes. > in other words if > an endpoint is designed with a prefetchable BAR that has read side > effects this is already an issue on x86 in the current kernel. An powerpc. We remove the "G" bit. Same deal. > There is that, plus the usage of arch_can_pci_mmap_wc() in mellanox > drivers which I suspect it is yet another interpretation of x86 write > combine - I don't know what happens if we let arch_can_pci_mmap_wc() == 1 > on both normalNC or deviceGRE mappings for pgprot_writecombine. > > I think it is worth getting to the bottom of this before applying > this patch. I think it basically boils down to mapping things without side effect and ordering guarantees but that still cannot be cached. Cheers, Ben.
On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote: > Yes I do and I expressed them. > > The first concern is the WC ambiguity on non-x86 systems, it looks > like write combinining means everything and nothing at the same time > on != x86 arches. > > On x86 prefetchable BAR == WC mapping (still conditional on arch > features ie PAT, not a blanket enable). On ARM64 WC mapping currently > corresponds to normal NC memory and the PCIe specs allow read > side-effects BAR to be marked as prefetchable, I need to force PCI > sig > to remove the section I mentioned from the specifications because > there > is NO way it can be detected if a prefetchable BAR maps to read > side-effects memory. Im not sure I understand your sentence. It's been a long accepted rule in PCI land that "prefetchable" BARs means "no side effects" and in fact allows much more than just prefetching :-) > A kernel device driver would (hopefully) know, sysfs code that just > checks the prefetchable attribute and exports resource_WC does not. > > As I mentioned, if the mapping is done in a device specific driver it > can be vetted and there are not many drivers mapping BARs as > ioremap_wc(). It's been what other architectures have been doing for mroe than a decade without significant issues... I don't think you should worry too much about this. Cheers, Ben.
On Wed, 2020-09-02 at 18:21 +0100, Catalin Marinas wrote: > > I understand we do want weak ordering for prefetchable BAR mappings > > but my worry is that by exposing the resources as WC to user space > > we are giving user space the impression that those mappings mirror > > x86 WC mappings behaviour that is not true on ARM64. > > Would Device_GRE be close to the x86 WC better? It won't allow > unaligned > accesses and that can be problematic for the user. OTOH, it doesn't > speculate reads, so it's safer from the hardware perspective. Its accepted generally that prefetchable BARs can allow speculative accesses, write combining, re-ordering even etc... and it's also commonly be the target of unaligned accesses. In reality, in PCI land, it really means "no side effects". Cheers, Ben.
On Thu, 2020-09-03 at 09:08 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2020-09-02 at 18:21 +0100, Catalin Marinas wrote: > > > I understand we do want weak ordering for prefetchable BAR > > > mappings > > > but my worry is that by exposing the resources as WC to user > > > space > > > we are giving user space the impression that those mappings > > > mirror > > > x86 WC mappings behaviour that is not true on ARM64. > > > > Would Device_GRE be close to the x86 WC better? It won't allow > > unaligned > > accesses and that can be problematic for the user. OTOH, it doesn't > > speculate reads, so it's safer from the hardware perspective. > > Its accepted generally that prefetchable BARs can allow speculative > accesses, write combining, re-ordering even etc... and it's also > commonly be the target of unaligned accesses. > > In reality, in PCI land, it really means "no side effects". Correction: No *read* side effects. Cheers, Ben.
On Thu, Sep 03, 2020 at 09:07:00AM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote: > > Yes I do and I expressed them. > > > > The first concern is the WC ambiguity on non-x86 systems, it looks > > like write combinining means everything and nothing at the same time > > on != x86 arches. > > > > On x86 prefetchable BAR == WC mapping (still conditional on arch > > features ie PAT, not a blanket enable). On ARM64 WC mapping currently > > corresponds to normal NC memory and the PCIe specs allow read > > side-effects BAR to be marked as prefetchable, I need to force PCI > > sig > > to remove the section I mentioned from the specifications because > > there > > is NO way it can be detected if a prefetchable BAR maps to read > > side-effects memory. > > Im not sure I understand your sentence. It's been a long accepted rule > in PCI land that "prefetchable" BARs means "no side effects" and in > fact allows much more than just prefetching :-) I am referring to the nefarious: "Additional Guidance on the Prefetchable Bit in Memory Space BARs" I read it 100 times and I still have no idea how it can be implemented, it sorts of acknowledges that read side-effects memory can be marked as a prefetchable BAR *if* the system meets some criteria. As if endpoint designers knew the system where their endpoint is plugged into (+ bit (3) in a BAR is read-only). I think that that implementation note must be removed from the specifications - if anyone dares to follow it this whole WC resource mapping can trigger trouble. Good news is that it would be trouble for all arches :) > > A kernel device driver would (hopefully) know, sysfs code that just > > checks the prefetchable attribute and exports resource_WC does not. > > > > As I mentioned, if the mapping is done in a device specific driver it > > can be vetted and there are not many drivers mapping BARs as > > ioremap_wc(). > > It's been what other architectures have been doing for mroe than a > decade without significant issues... I don't think you should worry too > much about this. Minus what I wrote above, I agree with you. I'd still be able to understand what this patch changes in the mellanox driver HW handling though - not sure what they expect from arch_can_pci_mmap_wc() returning 1. Thanks, Lorenzo
Adding some Mellanox folks to comment on their usage of arch_can_pci_mmap_wc(). On Thu, Sep 03, 2020 at 12:08:44PM +0100, Lorenzo Pieralisi wrote: > On Thu, Sep 03, 2020 at 09:07:00AM +1000, Benjamin Herrenschmidt wrote: > > On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote: > > > Yes I do and I expressed them. > > > > > > The first concern is the WC ambiguity on non-x86 systems, it looks > > > like write combinining means everything and nothing at the same time > > > on != x86 arches. > > > > > > On x86 prefetchable BAR == WC mapping (still conditional on arch > > > features ie PAT, not a blanket enable). On ARM64 WC mapping currently > > > corresponds to normal NC memory and the PCIe specs allow read > > > side-effects BAR to be marked as prefetchable, I need to force PCI > > > sig > > > to remove the section I mentioned from the specifications because > > > there > > > is NO way it can be detected if a prefetchable BAR maps to read > > > side-effects memory. > > > > Im not sure I understand your sentence. It's been a long accepted rule > > in PCI land that "prefetchable" BARs means "no side effects" and in > > fact allows much more than just prefetching :-) > > I am referring to the nefarious: > > "Additional Guidance on the Prefetchable Bit in Memory Space BARs" > > I read it 100 times and I still have no idea how it can be implemented, > it sorts of acknowledges that read side-effects memory can be marked > as a prefetchable BAR *if* the system meets some criteria. > > As if endpoint designers knew the system where their endpoint is > plugged into (+ bit (3) in a BAR is read-only). > > I think that that implementation note must be removed from the > specifications - if anyone dares to follow it this whole > WC resource mapping can trigger trouble. > > Good news is that it would be trouble for all arches :) > > > > A kernel device driver would (hopefully) know, sysfs code that just > > > checks the prefetchable attribute and exports resource_WC does not. > > > > > > As I mentioned, if the mapping is done in a device specific driver it > > > can be vetted and there are not many drivers mapping BARs as > > > ioremap_wc(). > > > > It's been what other architectures have been doing for mroe than a > > decade without significant issues... I don't think you should worry too > > much about this. > > Minus what I wrote above, I agree with you. I'd still be able to > understand what this patch changes in the mellanox driver HW > handling though - not sure what they expect from arch_can_pci_mmap_wc() > returning 1. This seems to have been added broadly for x86, PPC, and ARM as part of initial WC support in the driver (37aa5c36 "IB/mlx5: Add UARs write-combining and non-cached mapping"). It was updated to use `arch_can_pci_mmap_wc()` later (1f3db161 "IB/mlx5: Generally use the WC auto detection test result"). Guy, Yishai, there are some concerns about difference the technical definition of WC and how WC is actually used. Can you comment on the usage of WC in mlx5 and which definition of WC the driver utilizes? We're unsure if a blanket enable for arm64 is safe in light of the driver's use of this function. Thanks, Clint > > Thanks, > Lorenzo
On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > "Additional Guidance on the Prefetchable Bit in Memory Space BARs" > > I read it 100 times and I still have no idea how it can be > implemented, > it sorts of acknowledges that read side-effects memory can be marked > as a prefetchable BAR *if* the system meets some criteria. > > As if endpoint designers knew the system where their endpoint is > plugged into (+ bit (3) in a BAR is read-only). > > I think that that implementation note must be removed from the > specifications - if anyone dares to follow it this whole > WC resource mapping can trigger trouble. Ah that one ! Yes you are right its completely broken. This part of the spec aims at working around the fact that bridges only have 64-bit prefetchable windows, so anything non-pref has to go below a 32-bit bridge window (effectively making most 64-bit non-pref BARs a pointless waste of silicon). The right fix of course would have been to create a new type of bridge window. But PCI... If you're going to mess around with the SIG, I would suggest that a better approach short of the above would be to allow system software to put 64-bit non-pref BARs below bridge pref windows on PCIe (provided the various otehr restrictions in that note are honored such as bridges not prefetching) and leave it at that. (Unless they already do somewhere else, I forgot ...). This should be sufficient to address the space concern without killing the meaning of the prefetchable bit. As for enabling the _wc files in sysfs, well, you need some serious priviledge to be able to access them, so I don't see a big issue allowing them to exist when "prefetchable" is set regardless of that rule. The Mellanox case might be different. Cheers, Ben.
On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > It's been what other architectures have been doing for mroe than a > > decade without significant issues... I don't think you should worry > > too > > much about this. > > Minus what I wrote above, I agree with you. I'd still be able to > understand what this patch changes in the mellanox driver HW > handling though - not sure what they expect from > arch_can_pci_mmap_wc() > returning 1. I don't know enough to get into the finer details but looking a bit it seems when this is set, they allow extra ioctls to create buffers mapped with pgprot_writecombine(). I suppose this means faster MMIO backet buffers for small packets (ie, non-DMA use case). Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a non-ROCE ethernet port on a PF... For anyting else, it just seems to actually try to do it and see what happens :-) Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and whether you see an issue with enabling this on arm64 ? Cheers, Ben.
[+Jason] On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > > It's been what other architectures have been doing for mroe than a > > > decade without significant issues... I don't think you should worry > > > too > > > much about this. > > > > Minus what I wrote above, I agree with you. I'd still be able to > > understand what this patch changes in the mellanox driver HW > > handling though - not sure what they expect from > > arch_can_pci_mmap_wc() > > returning 1. > > I don't know enough to get into the finer details but looking a bit it > seems when this is set, they allow extra ioctls to create buffers > mapped with pgprot_writecombine(). > > I suppose this means faster MMIO backet buffers for small packets (ie, > non-DMA use case). > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a > non-ROCE ethernet port on a PF... For anyting else, it just seems to > actually try to do it and see what happens :-) > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and > whether you see an issue with enabling this on arm64 ? Hi Jason, I was wondering if you could help us with this question, we are trying to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause in mellanox drivers wrt mappings and whether there is an expected behaviour behind them, in particular whether there is an implicit reliance on x86 write-combine arch/interconnect details. Thanks, Lorenzo
On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote: > [+Jason] > > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > > > It's been what other architectures have been doing for mroe than a > > > > decade without significant issues... I don't think you should worry > > > > too > > > > much about this. > > > > > > Minus what I wrote above, I agree with you. I'd still be able to > > > understand what this patch changes in the mellanox driver HW > > > handling though - not sure what they expect from > > > arch_can_pci_mmap_wc() > > > returning 1. > > > > I don't know enough to get into the finer details but looking a bit it > > seems when this is set, they allow extra ioctls to create buffers > > mapped with pgprot_writecombine(). > > > > I suppose this means faster MMIO backet buffers for small packets (ie, > > non-DMA use case). > > > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a > > non-ROCE ethernet port on a PF... For anyting else, it just seems to > > actually try to do it and see what happens :-) > > > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and > > whether you see an issue with enabling this on arm64 ? > > Hi Jason, > > I was wondering if you could help us with this question, we are trying > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause > in mellanox drivers wrt mappings and whether there is an expected > behaviour behind them, in particular whether there is an implicit > reliance on x86 write-combine arch/interconnect details. Sorry, somehow I missed this email. The arch_can_pci_mmap_wc() used in IB representors, special mode where we can't perform write-combine test below. The commit 1f3db161881b ("IB/mlx5: Generally use the WC auto detection test result") describes it. I don't see any problem with enabling arch_can_pci_mmap_wc() on ARM. Thanks > > Thanks, > Lorenzo
On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote: > [+Jason] > > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > > > It's been what other architectures have been doing for mroe than a > > > > decade without significant issues... I don't think you should worry > > > > too > > > > much about this. > > > > > > Minus what I wrote above, I agree with you. I'd still be able to > > > understand what this patch changes in the mellanox driver HW > > > handling though - not sure what they expect from > > > arch_can_pci_mmap_wc() > > > returning 1. > > > > I don't know enough to get into the finer details but looking a bit it > > seems when this is set, they allow extra ioctls to create buffers > > mapped with pgprot_writecombine(). > > > > I suppose this means faster MMIO backet buffers for small packets (ie, > > non-DMA use case). > > > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a > > non-ROCE ethernet port on a PF... For anyting else, it just seems to > > actually try to do it and see what happens :-) > > > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and > > whether you see an issue with enabling this on arm64 ? > > Hi Jason, > > I was wondering if you could help us with this question, we are trying > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause > in mellanox drivers wrt mappings and whether there is an expected > behaviour behind them, in particular whether there is an implicit > reliance on x86 write-combine arch/interconnect details. Looking back at this big thread, let me add some perspective Mellanox drivers have a performance optimization where a 64 byte MemWr TLP from the root complex to the MMIO BAR will perform better, often quite a bit better. We run WC in full QA'd production on PPC, ARM and x86. The userspace generates a burst of sequential, aligned 8 byte CPU writes to the MMIO address and triggers an arch specific CPU barrier to flush/fence the CPU WC buffer. At this point the CPU should emit the 64 byte TLP toward the device ASAP. In other words, the only usage here is only about Write. The CPU should never, ever, generate a MemRD TLP. The code never does a read explicitly. If the CPU fails to generate a 64 byte TLP then the device will still operate correctly but does a different, slower, flow. If the CPU consistently fails WC then the overhead of trying the WC flow is a notable net performance loss, and on these CPUs we want to use only 8 byte write to the MMIO BAR, with NC memory. There are many important details about how this works and how this must interact with the CPU barriers and locking. On x86, arch_can_pci_mmap_wc() is basically meaningless. It indicates there is a chance that pgprot_writecombine() could work. It can also be 0 and write combining will work just fine :\. Thus, mlx5 switched to doing a runtime WC test to determine if the CPU actually supports WC or not. If the arch can reliably tell the driver then this test could be avoided. Based on this test the WC mode is allowed for userspace. The one call to arch_can_pci_mmap_wc() is in a case where the HW is configured in a way that can't run the test, here we use arch_can_pci_mmap_wc() to guess if the CPU has working WC or not. Ideally an arch would return 1 only when the CPU has working WC. Depending on workload WC may not be a win. In those cases userspace will select NC. Thus the same PCI MMIO BAR region can have a mixture of pages with WC and NC mappings to userspace. For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This seems to be the preferred working configuration on at least some ARM SOCs. So far nobody from the ARM world has shown interest in making a mainline solution. :( I can't recall if this is because the relevant ARM SOC's don't support pgprot_writecombine(), or it doesn't work properly. I was told the reason ARM never enabled WC was because unaligned access to WC memory was not supported, and there were existing drivers that did unaligned writes that would malfunction. I thought this meant that pgprot_writecombine() was non-working in ARM Linux? So, bit surprised to see a patch messing with arch_can_pci_mmap_wc() and not changing the defintion of pgprot_writecombine() ? mlx5 is more or less a representative user WC for this kind of 'message push' methodology. Several other RDMA devices do this as well. The methodology is important enough that recent Intel CPUs have a dedicated instruction to push a 128 byte message in a single TLP avoiding this whole WC mess. Frankly, I think the kernel should introduce a well defined pgprot for this working mode that all archs can agree upon. It should include the alignment requirement, message push function, CPU barrier macros, and locking macros that are needed to use this facility correctly. Defined in a way that is compatible with DEVICE_GRE and can be used by these 'message push' drivers. That would switch alway most of the users in the kernel today. Jason
On Thu, Sep 10, 2020 at 09:37:58AM -0300, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote: > > [+Jason] > > > > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote: > > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > > > > It's been what other architectures have been doing for mroe than a > > > > > decade without significant issues... I don't think you should worry > > > > > too > > > > > much about this. > > > > > > > > Minus what I wrote above, I agree with you. I'd still be able to > > > > understand what this patch changes in the mellanox driver HW > > > > handling though - not sure what they expect from > > > > arch_can_pci_mmap_wc() > > > > returning 1. > > > > > > I don't know enough to get into the finer details but looking a bit it > > > seems when this is set, they allow extra ioctls to create buffers > > > mapped with pgprot_writecombine(). > > > > > > I suppose this means faster MMIO backet buffers for small packets (ie, > > > non-DMA use case). > > > > > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a > > > non-ROCE ethernet port on a PF... For anyting else, it just seems to > > > actually try to do it and see what happens :-) > > > > > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and > > > whether you see an issue with enabling this on arm64 ? > > > > Hi Jason, > > > > I was wondering if you could help us with this question, we are trying > > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause > > in mellanox drivers wrt mappings and whether there is an expected > > behaviour behind them, in particular whether there is an implicit > > reliance on x86 write-combine arch/interconnect details. > > Looking back at this big thread, let me add some perspective Thank you - it was needed. > Mellanox drivers have a performance optimization where a 64 byte MemWr > TLP from the root complex to the MMIO BAR will perform better, often > quite a bit better. We run WC in full QA'd production on PPC, ARM and > x86. > > The userspace generates a burst of sequential, aligned 8 byte CPU > writes to the MMIO address and triggers an arch specific CPU barrier > to flush/fence the CPU WC buffer. At this point the CPU should emit > the 64 byte TLP toward the device ASAP. While at it - mind explaining please what those 64 bytes actully contain ? > In other words, the only usage here is only about Write. The CPU > should never, ever, generate a MemRD TLP. The code never does a read > explicitly. On arm64 pgprot_writecombine() is speculative memory (normal non-cacheable), which may not do what you expect from it. > If the CPU fails to generate a 64 byte TLP then the device will still > operate correctly but does a different, slower, flow. Side note: on ARM that TLP is not a native interconnect transaction, reworded, it depends on what the system-bus->PCI logic does in this respect. > If the CPU consistently fails WC then the overhead of trying the WC > flow is a notable net performance loss, and on these CPUs we want to > use only 8 byte write to the MMIO BAR, with NC memory. That's why I looped you in - that's what worries me about "enabling" arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf regressions that's not OK. Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox driver (or more broadly all drivers following this message push semantics) to use "something else" for WC detection. > There are many important details about how this works and how this > must interact with the CPU barriers and locking. > > On x86, arch_can_pci_mmap_wc() is basically meaningless. On arm64 too, for the records - or better, write-combine is not well defined, ergo I don't know what arch_can_pci_mmap_wc() means. > It indicates there is a chance that pgprot_writecombine() could work. > It can also be 0 and write combining will work just fine :\. > > Thus, mlx5 switched to doing a runtime WC test to determine if the CPU > actually supports WC or not. If the arch can reliably tell the driver > then this test could be avoided. Based on this test the WC mode is > allowed for userspace. Can you elaborate on this runtime test please ? > The one call to arch_can_pci_mmap_wc() is in a case where the HW is > configured in a way that can't run the test, here we use > arch_can_pci_mmap_wc() to guess if the CPU has working WC or not. > Ideally an arch would return 1 only when the CPU has working WC. Which means we can guarantee the TLP packet you mentioned above I guess ? We have to define "working WC" :) > Depending on workload WC may not be a win. In those cases userspace > will select NC. Thus the same PCI MMIO BAR region can have a mixture > of pages with WC and NC mappings to userspace. > > For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices > are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This > seems to be the preferred working configuration on at least some ARM > SOCs. So far nobody from the ARM world has shown interest in making a > mainline solution. :( > > I can't recall if this is because the relevant ARM SOC's don't support > pgprot_writecombine(), or it doesn't work properly. > > I was told the reason ARM never enabled WC was because unaligned When you say "enabled WC" I assume you mean making: pgprot_writecombine() == DEVICE_GRE > access to WC memory was not supported, and there were existing drivers > that did unaligned writes that would malfunction. I thought this meant > that pgprot_writecombine() was non-working in ARM Linux? On arm64 pgprot_writecombine() is normal non-cacheable memory at the moment - it works but that does not precisely do what you *expect* from arch_can_pci_mmap_wc(), that's the whole point I am making. > So, bit surprised to see a patch messing with arch_can_pci_mmap_wc() > and not changing the defintion of pgprot_writecombine() ? We can't change pgprot_writecombine() to DEVICE_GRE, it can trigger issues on some drivers, see unaligned memory access. > mlx5 is more or less a representative user WC for this kind of > 'message push' methodology. Several other RDMA devices do this as > well. The methodology is important enough that recent Intel CPUs have > a dedicated instruction to push a 128 byte message in a single TLP > avoiding this whole WC mess. > > Frankly, I think the kernel should introduce a well defined pgprot for > this working mode that all archs can agree upon. It should include the > alignment requirement, message push function, CPU barrier macros, and > locking macros that are needed to use this facility correctly. > > Defined in a way that is compatible with DEVICE_GRE and can be used by > these 'message push' drivers. That would switch alway most of the > users in the kernel today. That's probably the way forward - I still have concerns about this patch as it stands given your clarifications above. Lorenzo
On Thu, Sep 10, 2020 at 04:17:21PM +0100, Lorenzo Pieralisi wrote: > > In other words, the only usage here is only about Write. The CPU > > should never, ever, generate a MemRD TLP. The code never does a read > > explicitly. > > On arm64 pgprot_writecombine() is speculative memory (normal > non-cacheable), which may not do what you expect from it. Can you explain what this actually does on ARM? Can it ever speculate loads across page boundaries, or speculate loads that never exist in the program? ie will we get random unpredicable MemRds? Does it/could it "combine writes"? > > If the CPU fails to generate a 64 byte TLP then the device will still > > operate correctly but does a different, slower, flow. > > Side note: on ARM that TLP is not a native interconnect transaction, > reworded, it depends on what the system-bus->PCI logic does in > this respect. I think the issue is that ARM never defined what the bits set by pgprot_writecombine() do at a system level so we see implementations that do not cause write combining at the PCI-E interface for those bits. (I assume from what I've heard) > That's why I looped you in - that's what worries me about "enabling" > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > regressions that's not OK. > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > driver (or more broadly all drivers following this message push > semantics) to use "something else" for WC detection. arch_can_pci_mmap_wc() really only controls the sysfs resource file and it seems very unclear who in userspace uses that these days. vfio is now the right way to do that stuff. I don't see an obvious way to get WC memory in VFIO though... So, I don't think this patch will break anything, however I also don't see a point to it as nobody should be using the sys/resource files these days.. Curios why Clint proposed it? > > Thus, mlx5 switched to doing a runtime WC test to determine if the CPU > > actually supports WC or not. If the arch can reliably tell the driver > > then this test could be avoided. Based on this test the WC mode is > > allowed for userspace. > > Can you elaborate on this runtime test please ? mlx5 is a network device, so the purpose of the 64 byte TLP is to push, say, a network send command to the device fully formed (eg with the DMA list, etc). Otherwise we can only push an indication to read the 64 byte command via DMA from the command ring. Avoiding the extra DMA is a performance win. The runtime test pushes a 64 byte command to the device. If it arrives as a single TLP then the device executes that command, otherwise it triggers DMA and reads the 64 bytes from host memory. The test arranges that the command pushed via the TLP and the command fetched via DMA are different. The test can then determine which path the device took and thus know directly if the implementation can deliver the required TLP or not. > > configured in a way that can't run the test, here we use > > arch_can_pci_mmap_wc() to guess if the CPU has working WC or not. > > Ideally an arch would return 1 only when the CPU has working WC. > > Which means we can guarantee the TLP packet you mentioned above I > guess ? Yes. For a PCI driver I think this is the only thing that matters for "write combining". mlx5 is pretty strict, other drivers might gain advantage from smaller or more fragmented transfers, eg 32 byte chunks or something. > > Depending on workload WC may not be a win. In those cases userspace > > will select NC. Thus the same PCI MMIO BAR region can have a mixture > > of pages with WC and NC mappings to userspace. > > > > For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices > > are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This > > seems to be the preferred working configuration on at least some ARM > > SOCs. So far nobody from the ARM world has shown interest in making a > > mainline solution. :( > > > > I can't recall if this is because the relevant ARM SOC's don't support > > pgprot_writecombine(), or it doesn't work properly. > > > > I was told the reason ARM never enabled WC was because unaligned > > When you say "enabled WC" I assume you mean making: > > pgprot_writecombine() == DEVICE_GRE Well, when I say "enabled WC" I mean it seems existing pgprot_writecombine() does not give write combining at PCI-E and the option that does give write combining is DEVICE_GRE, that has the alignment problem and can't be used. At least on some SOCs. > > access to WC memory was not supported, and there were existing drivers > > that did unaligned writes that would malfunction. I thought this meant > > that pgprot_writecombine() was non-working in ARM Linux? > > On arm64 pgprot_writecombine() is normal non-cacheable memory at the > moment - it works but that does not precisely do what you *expect* from > arch_can_pci_mmap_wc(), that's the whole point I am making. Most places use pgprot_writecombine() blindly and just ignore arch_can_pci_mmap_wc() - I suppose with the idea that pgprot_writecombine() will be a harmless NOP if it isn't supported? mlx5 is possibly the only place where someone actually tested and cared about the performance difference between using a WC specific algorithm or not. > > mlx5 is more or less a representative user WC for this kind of > > 'message push' methodology. Several other RDMA devices do this as > > well. The methodology is important enough that recent Intel CPUs have > > a dedicated instruction to push a 128 byte message in a single TLP > > avoiding this whole WC mess. > > > > Frankly, I think the kernel should introduce a well defined pgprot for > > this working mode that all archs can agree upon. It should include the > > alignment requirement, message push function, CPU barrier macros, and > > locking macros that are needed to use this facility correctly. > > > > Defined in a way that is compatible with DEVICE_GRE and can be used by > > these 'message push' drivers. That would switch alway most of the > > users in the kernel today. > > That's probably the way forward - I still have concerns about this > patch as it stands given your clarifications above. I would very much like to see some support for DEVICE_GRE (at least on the SOCs that require it) in ARM going forward. For whatever reason this seems to be necessary to get write PCI combining behavior on enough ARM hardware that it should be supported in the upstream kernel. Jason
On Thu, 2020-09-10 at 14:10 -0300, Jason Gunthorpe wrote: > Can you explain what this actually does on ARM? > > Can it ever speculate loads across page boundaries, or speculate > loads > that never exist in the program? ie will we get random unpredicable > MemRds? Probably, at least on powerpc you will as well, that's the only way to get write combine. > Does it/could it "combine writes"? I assume so for ARM, definitely for powerpc. > > > If the CPU fails to generate a 64 byte TLP then the device will > > > still > > > operate correctly but does a different, slower, flow. > > > > Side note: on ARM that TLP is not a native interconnect > > transaction, > > reworded, it depends on what the system-bus->PCI logic does in > > this respect. > > I think the issue is that ARM never defined what the bits set by > pgprot_writecombine() do at a system level so we see implementations > that do not cause write combining at the PCI-E interface for those > bits. (I assume from what I've heard) Nobody did. I think only x86 has a real "write combine" attribute. I tried to untangled that mess years ago and didnt' get to the bottom of it, but basically, on non-x86 archs, pgprot_writecombine will give you what you asked ... and more. > > That's why I looped you in - that's what worries me about > > "enabling" > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > > regressions that's not OK. > > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > > driver (or more broadly all drivers following this message push > > semantics) to use "something else" for WC detection. > > arch_can_pci_mmap_wc() really only controls the sysfs resource file > and it seems very unclear who in userspace uses that these days. dpdk under some circumstances afaik. > vfio is now the right way to do that stuff. I don't see an obvious > way to get WC memory in VFIO though... Which would be a performance issue on a number of things I suppose... Cheers, Ben.
On Fri, Sep 11, 2020 at 07:46:47AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2020-09-10 at 14:10 -0300, Jason Gunthorpe wrote: > > Can you explain what this actually does on ARM? > > > > Can it ever speculate loads across page boundaries, or speculate > > loads > > that never exist in the program? ie will we get random unpredicable > > MemRds? > > Probably, at least on powerpc you will as well, that's the only way to > get write combine. If I remove the PROT_READ in the user space mmap will it block it? Read TLPs are not harmful but I suspect they would cause an undesirable random performance anomaly. > > Does it/could it "combine writes"? > > I assume so for ARM, definitely for powerpc. Various IBM PPC chips I know work, we do test that. > > > That's why I looped you in - that's what worries me about > > > "enabling" > > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > > > regressions that's not OK. > > > > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > > > driver (or more broadly all drivers following this message push > > > semantics) to use "something else" for WC detection. > > > > arch_can_pci_mmap_wc() really only controls the sysfs resource file > > and it seems very unclear who in userspace uses that these days. > > dpdk under some circumstances afaik. And something gross for DMA then? Not sure dpdk is useful without DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing? > > vfio is now the right way to do that stuff. I don't see an obvious > > way to get WC memory in VFIO though... > > Which would be a performance issue on a number of things I suppose... Almost nothing uses pci_iomap_wc(), so I'd be surpried if userspace DPDK was an important user when an in-kernel driver for the same HW doesn't use it? Jason
On Thu, 2020-09-10 at 20:29 -0300, Jason Gunthorpe wrote: > > Probably, at least on powerpc you will as well, that's the only way to > > get write combine. > > If I remove the PROT_READ in the user space mmap will it block it? No. powerpc at least doesn't have write-only mappings. > Read TLPs are not harmful but I suspect they would cause an > undesirable random performance anomaly. I suspect in practice you wont get them esp. if the code has barriers but ... it's allowed by the architecture. > > > Does it/could it "combine writes"? > > > > I assume so for ARM, definitely for powerpc. > > Various IBM PPC chips I know work, we do test that. > > > > > That's why I looped you in - that's what worries me about > > > > "enabling" > > > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > > > > regressions that's not OK. > > > > > > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > > > > driver (or more broadly all drivers following this message push > > > > semantics) to use "something else" for WC detection. > > > > > > arch_can_pci_mmap_wc() really only controls the sysfs resource file > > > and it seems very unclear who in userspace uses that these days. > > > > dpdk under some circumstances afaik. > > And something gross for DMA then? Not sure dpdk is useful without > DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing? Clint, can you elaborate on the use case ? > > > vfio is now the right way to do that stuff. I don't see an obvious > > > way to get WC memory in VFIO though... > > > > Which would be a performance issue on a number of things I suppose... > > Almost nothing uses pci_iomap_wc(), so I'd be surpried if userspace > DPDK was an important user when an in-kernel driver for the same HW > doesn't use it? Hard to know how uses those files out there but I don't like arm not providing what pretty much all relevant archs do provide since the semantics afaik aren't that different. Yes, "write combine" isn't a good name.... The goal is to get WC but it comes with the whole package on several archs. We don't even have a reasonnable definition of the semantics of readl/writel on a WC mapping (hint: on powerpc the barriers in them will prevent WC even on a WC mapping) nor of what barriers might work and how on such a mapping. I tried a while ago and ... ugh. Cheers, Ben.
On Fri, Sep 11, 2020 at 10:39:16AM +1000, Benjamin Herrenschmidt wrote: > Yes, "write combine" isn't a good name.... The goal is to get WC but it > comes with the whole package on several archs. We don't even have a > reasonnable definition of the semantics of readl/writel on a WC mapping > (hint: on powerpc the barriers in them will prevent WC even on a WC > mapping) nor of what barriers might work and how on such a mapping. Yes, you can't really use WC properly in the kernel, we don't have the infrastructure for it. mlx5 is using __raw_writeq() and wmb() to hack something ugly together in the kernel. A useful API for the message method, similar to what we use in userspace, is something like: /* * Almost always need a spinlock as multiple CPUs cannot write * concurrently. */ spin_lock(); /* * Ensure that all DMA visiable writes in program order are visible * to DMA before the WC TLP is sent. */ barrier_wc_after_lock(); /* Generate the TLP */ write_wc_message(wc_iomem, message, len); /* * Writes to wc_iomem past this, by any CPU, cannot replace writes * already done in wc_message. */ barrier_wc_before_unlock(); spin_unlock(); And another varient without the spinlock for stuff that can be per-CPU for a range of WC memory. (oh actually I see most drivers are using ioremap_wc(), and there is a bunch of them including an Amazon ethernet device...) Jason
On Fri, Sep 11, 2020 at 10:39:16AM +1000, Benjamin Herrenschmidt wrote: > > > > > That's why I looped you in - that's what worries me about > > > > > "enabling" > > > > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > > > > > regressions that's not OK. > > > > > > > > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > > > > > driver (or more broadly all drivers following this message push > > > > > semantics) to use "something else" for WC detection. > > > > > > > > arch_can_pci_mmap_wc() really only controls the sysfs resource file > > > > and it seems very unclear who in userspace uses that these days. > > > > > > dpdk under some circumstances afaik. > > > > And something gross for DMA then? Not sure dpdk is useful without > > DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing? > > Clint, can you elaborate on the use case ? > The use-case I'm targeting is the ENA pmd in DPDK. For performance reasons (many of which are very similar to what Jason has described for mlx5), we need to generate full-sized TLPs instead of many partial TLPs to improve efficiency. Here's an excerpt describing the write-combine usage from ./Documentation/networking/device_drivers/ethernet/amazon/ena.rst: - Low Latency Queue (LLQ) mode or "push-mode". * In this mode the driver pushes the transmit descriptors and the first 128 bytes of the packet directly to the ENA device memory space. The rest of the packet payload is fetched by the device. For this operation mode, the driver uses a dedicated PCI device memory BAR, which is mapped with write-combine capability. There's no DMA involved with this BAR-- the driver writes a portion of the packet contents in addition to the descriptors, which generally increases the number of TLPs if write-combine isn't used. Furthermore, this BAR is only used for writes and never for reads. As Jason noted in the other reply to this email, the Linux ENA driver makes use of WC by using devm_ioremap_wc(). The DPDK code here uses the same mechanism in user-space to enable write-combining by mapping the resourceX_wc file if the driver uses RTE_PCI_DRV_WC_ACTIVATE. Clint
On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote: > There's no DMA involved with this BAR-- the driver writes a portion of the > packet contents in addition to the descriptors, which generally increases the > number of TLPs if write-combine isn't used. Furthermore, this BAR is only used > for writes and never for reads. You use DPDK without DMA? How does receive work? > As Jason noted in the other reply to this email, the Linux ENA driver makes use > of WC by using devm_ioremap_wc(). As Ben noted we don't have kernel accessors to make this portable or safe :( Jason
On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote: > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote: > > > There's no DMA involved with this BAR-- the driver writes a portion of the > > packet contents in addition to the descriptors, which generally increases the > > number of TLPs if write-combine isn't used. Furthermore, this BAR is only used > > for writes and never for reads. > > You use DPDK without DMA? How does receive work? That was not worded well. DMA is used, but the first X bytes of the packet are written directly to this BAR instead of being DMA'd-- the rest of the data is DMA'd. > > > As Jason noted in the other reply to this email, the Linux ENA driver makes use > > of WC by using devm_ioremap_wc(). > > As Ben noted we don't have kernel accessors to make this portable or > safe :( And therein lies our dilemma... Clint
On Mon, Sep 14, 2020 at 02:24:06PM +0000, Clint Sbisa wrote: > On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote: > > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote: > > > > > There's no DMA involved with this BAR-- the driver writes a portion of the > > > packet contents in addition to the descriptors, which generally increases the > > > number of TLPs if write-combine isn't used. Furthermore, this BAR is only used > > > for writes and never for reads. > > > > You use DPDK without DMA? How does receive work? > > That was not worded well. DMA is used, but the first X bytes of the packet are > written directly to this BAR instead of being DMA'd-- the rest of the data is > DMA'd. which is back to my original question, how do you do DMA using /sys/xx/resources? Why not use VFIO like everything else? Jason
On Mon, 2020-09-14 at 11:17 -0300, Jason Gunthorpe wrote: > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote: > > > There's no DMA involved with this BAR-- the driver writes a portion > > of the > > packet contents in addition to the descriptors, which generally > > increases the > > number of TLPs if write-combine isn't used. Furthermore, this BAR > > is only used > > for writes and never for reads. > > You use DPDK without DMA? How does receive work? > > > As Jason noted in the other reply to this email, the Linux ENA > > driver makes use > > of WC by using devm_ioremap_wc(). > > As Ben noted we don't have kernel accessors to make this portable or > safe :( Well.. to be frank it does work "well enough" for simple cases like frame buffers :-) Cheers Ben.
On Mon, 2020-09-14 at 11:38 -0300, Jason Gunthorpe wrote: > On Mon, Sep 14, 2020 at 02:24:06PM +0000, Clint Sbisa wrote: > > On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote: > > > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote: > > > > > > > There's no DMA involved with this BAR-- the driver writes a > > > > portion of the > > > > packet contents in addition to the descriptors, which generally > > > > increases the > > > > number of TLPs if write-combine isn't used. Furthermore, this > > > > BAR is only used > > > > for writes and never for reads. > > > > > > You use DPDK without DMA? How does receive work? > > > > That was not worded well. DMA is used, but the first X bytes of the > > packet are > > written directly to this BAR instead of being DMA'd-- the rest of > > the data is > > DMA'd. > > which is back to my original question, how do you do DMA using > /sys/xx/resources? Why not use VFIO like everything else? Note: All this doesnt' change the fact that sys/xx/resources_wc exists for other archs and I see no reasons so far not to have it on ARM... Cheers, Ben.
On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > which is back to my original question, how do you do DMA using > > /sys/xx/resources? Why not use VFIO like everything else? > > Note: All this doesnt' change the fact that sys/xx/resources_wc > exists > for other archs and I see no reasons so far not to have it on ARM... Also... it looks like VFIO also doesn't provide a way to do WC yet unfortunately :-( There was a discussion 2 or 3 years ago while I was at IBM about ways to add that functionality, but it doesn't seem to have resulted in anything. Cheers, Ben.
On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > > > which is back to my original question, how do you do DMA using > > > /sys/xx/resources? Why not use VFIO like everything else? > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc > > exists > > for other archs and I see no reasons so far not to have it on ARM... > > Also... it looks like VFIO also doesn't provide a way to do WC yet > unfortunately :-( > > There was a discussion 2 or 3 years ago while I was at IBM about ways > to add that functionality, but it doesn't seem to have resulted in > anything. And to complete answering the question, the ENA PMD in DPDK also supports vfio, which-- as Ben pointed out-- doesn't support WC either. Clint
On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > > > which is back to my original question, how do you do DMA using > > > /sys/xx/resources? Why not use VFIO like everything else? > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc > > exists > > for other archs and I see no reasons so far not to have it on ARM... > > Also... it looks like VFIO also doesn't provide a way to do WC yet > unfortunately :-( Yes, but if the driving reason for this patch is because a VFIO user like EFA DPDK is trying to work around VFIO limitations, then I'd say the VFIO mmap should be amended, and not so much worring about sysfs. While there is no reason for ARM to not show the sysfs, it really should never be used. Modern kernels in secure boot don't even show it, for instance. Jason
On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote: > > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > > > > > which is back to my original question, how do you do DMA using > > > > /sys/xx/resources? Why not use VFIO like everything else? > > > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc > > > exists > > > for other archs and I see no reasons so far not to have it on ARM... > > > > Also... it looks like VFIO also doesn't provide a way to do WC yet > > unfortunately :-( > > Yes, but if the driving reason for this patch is because a VFIO user > like EFA DPDK is trying to work around VFIO limitations, then I'd say > the VFIO mmap should be amended, and not so much worring about sysfs. I don't think the two are exclusive. > While there is no reason for ARM to not show the sysfs, it really > should never be used. Modern kernels in secure boot don't even show > it, for instance. It's useful for random things, I've used it quite a bit in a previous life for things like in-lab hw testing etc... There's tooling out there, esp. in the more 'embedded' side of thing that uses this, I don't see a good reason not to provide the same level of functionality. So Lorenzo, imho, we should merge the patch. As for fixing VFIO, definitely something to revive. The main contention point was which "interface" to use to request write combine. Let's restart that conversation with the appropriate folks, the last I remember, the question was to figure out what interface to provide userspace for the functionality. Clint, do you want to drive this as well ? Cheers, Ben.
On Tue, Sep 15, 2020 at 09:25:57AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote: > > > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > > > > > > > which is back to my original question, how do you do DMA using > > > > > /sys/xx/resources? Why not use VFIO like everything else? > > > > > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc > > > > exists > > > > for other archs and I see no reasons so far not to have it on ARM... > > > > > > Also... it looks like VFIO also doesn't provide a way to do WC yet > > > unfortunately :-( > > > > Yes, but if the driving reason for this patch is because a VFIO user > > like EFA DPDK is trying to work around VFIO limitations, then I'd say > > the VFIO mmap should be amended, and not so much worring about sysfs. > > I don't think the two are exclusive. > > > While there is no reason for ARM to not show the sysfs, it really > > should never be used. Modern kernels in secure boot don't even show > > it, for instance. > > It's useful for random things, I've used it quite a bit in a previous > life for things like in-lab hw testing etc... There's tooling out > there, esp. in the more 'embedded' side of thing that uses this, I > don't see a good reason not to provide the same level of functionality. > > So Lorenzo, imho, we should merge the patch. To sum it up: (1) RDMA drivers need a new mapping function/attribute to define their message push model. Actually the message model is not necessarily related to write combining a la x86, so we should probably come up with a better and consistent naming. Enabling this patchset may trigger performance regressions on mellanox drivers on arm64 - this ought to be addressed. (2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since it looks relevant, WC message model semantics should be introduced there, somehow. (3) Given the above, the sysfs interface can be enabled. I still don't see the advantages (other than saying it is there for other arches, ie what can you really do with the sysfs mappings - see Jason's VFIO/DMA query) but we can do it. Still, I am not happy about the possible mellanox regressions - that should be tested/verified before this patch is merged IMHO. Do we really need it for v5.10 ? Thoughts ? Lorenzo > As for fixing VFIO, definitely something to revive. The main contention > point was which "interface" to use to request write combine. > > Let's restart that conversation with the appropriate folks, the last I > remember, the question was to figure out what interface to provide > userspace for the functionality. > > Clint, do you want to drive this as well ? > > Cheers, > Ben. >
On Tue, Sep 15, 2020 at 11:18:31AM +0100, Lorenzo Pieralisi wrote: > On Tue, Sep 15, 2020 at 09:25:57AM +1000, Benjamin Herrenschmidt wrote: > > On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote: > > > > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote: > > > > > > > > > > > which is back to my original question, how do you do DMA using > > > > > > /sys/xx/resources? Why not use VFIO like everything else? > > > > > > > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc > > > > > exists > > > > > for other archs and I see no reasons so far not to have it on ARM... > > > > > > > > Also... it looks like VFIO also doesn't provide a way to do WC yet > > > > unfortunately :-( > > > > > > Yes, but if the driving reason for this patch is because a VFIO user > > > like EFA DPDK is trying to work around VFIO limitations, then I'd say > > > the VFIO mmap should be amended, and not so much worring about sysfs. > > > > I don't think the two are exclusive. > > > > > While there is no reason for ARM to not show the sysfs, it really > > > should never be used. Modern kernels in secure boot don't even show > > > it, for instance. > > > > It's useful for random things, I've used it quite a bit in a previous > > life for things like in-lab hw testing etc... There's tooling out > > there, esp. in the more 'embedded' side of thing that uses this, I > > don't see a good reason not to provide the same level of functionality. > > > > So Lorenzo, imho, we should merge the patch. > > To sum it up: > > (1) RDMA drivers need a new mapping function/attribute to define their > message push model. Actually the message model is not necessarily related > to write combining a la x86, so we should probably come up with a better > and consistent naming. Enabling this patchset may trigger performance > regressions on mellanox drivers on arm64 - this ought to be > addressed. It is pretty clear now that the certain ARM chips that don't do write combining with pgprot_writecombine will performance regress if they are running a certain uncommon Mellanox configuration. I suspect these deployments are all running the out of tree patch for DEVICE_GRE though. So, I wouldn't worry about it - but would prefer to see ARM folks advance some proposal to accommodate those chips that need DEVICE_GRE. Jason
On Tue, 2020-09-15 at 11:18 +0100, Lorenzo Pieralisi wrote: > > It's useful for random things, I've used it quite a bit in a previous > > life for things like in-lab hw testing etc... There's tooling out > > there, esp. in the more 'embedded' side of thing that uses this, I > > don't see a good reason not to provide the same level of functionality. > > > > So Lorenzo, imho, we should merge the patch. > > To sum it up: > > (1) RDMA drivers need a new mapping function/attribute to define their > message push model. Actually the message model is not necessarily related > to write combining a la x86, so we should probably come up with a better > and consistent naming. Enabling this patchset may trigger performance > regressions on mellanox drivers on arm64 - this ought to be addressed. I doubt it. Besides Mellanox will probably enable WC already through the other code path (the use of this accessor is only one of the path that enables the driver to do WC). I don't think we need to solve the RDMA semantics issue that urgently TBH, and it's definitely an orthogonal issue to that at hand. > (2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since > it looks relevant, WC message model semantics should be introduced > there, somehow. Yes. > (3) Given the above, the sysfs interface can be enabled. I still don't > see the advantages (other than saying it is there for other arches, ie > what can you really do with the sysfs mappings - see Jason's VFIO/DMA > query) but we can do it. Still, I am not happy about the possible > mellanox regressions - that should be tested/verified before this > patch is merged IMHO. Do we really need it for v5.10 ? I don't think there's a significant risk of regression, but then I also dont' have a way to test :-) Cheers, Ben.
On Wed, Sep 16, 2020 at 09:00:09AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 11:18 +0100, Lorenzo Pieralisi wrote: > > To sum it up: > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > message push model. Actually the message model is not necessarily related > > to write combining a la x86, so we should probably come up with a better > > and consistent naming. Enabling this patchset may trigger performance > > regressions on mellanox drivers on arm64 - this ought to be addressed. > > I doubt it. Besides Mellanox will probably enable WC already through > the other code path (the use of this accessor is only one of the path > that enables the driver to do WC). > > I don't think we need to solve the RDMA semantics issue that urgently > TBH, and it's definitely an orthogonal issue to that at hand. > > > (2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since > > it looks relevant, WC message model semantics should be introduced > > there, somehow. > > Yes. I will ping some folks on the VFIO patch to see if we can get the ball rolling there again. > > > (3) Given the above, the sysfs interface can be enabled. I still don't > > see the advantages (other than saying it is there for other arches, ie > > what can you really do with the sysfs mappings - see Jason's VFIO/DMA > > query) but we can do it. Still, I am not happy about the possible > > mellanox regressions - that should be tested/verified before this > > patch is merged IMHO. Do we really need it for v5.10 ? > > I don't think there's a significant risk of regression, but then I also > dont' have a way to test :-) I'm going to re-submit this patch to Catalin and Will with an updated commit message capturing the context from this discussion (and cc everyone involved). As for the whole device GRE / new naming context-- I'll have to defer to Lorenzo on suggested next steps on this front. Thanks, Clint
On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote: > > To sum it up: > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > message push model. Actually the message model is not necessarily related > > to write combining a la x86, so we should probably come up with a better > > and consistent naming. Enabling this patchset may trigger performance > > regressions on mellanox drivers on arm64 - this ought to be > > addressed. > > It is pretty clear now that the certain ARM chips that don't do write > combining with pgprot_writecombine will performance regress if they > are running a certain uncommon Mellanox configuration. I suspect these > deployments are all running the out of tree patch for DEVICE_GRE > though. I'm not sure I understand... Today those ARM chips will not use pgprot_writecombine (at least not using that code path, they might still use it as the result of the other path in the driver that can enable it). So they get MT_DEVICE_nGnRnE (unless I missed something here). So they will not combine. With the patch, those device will now use MT_DEVICE_NC. Why would that be a regression ? It will allow speculation, that doesn't necessarily mean that the CPU will cause spurrious accesses, it probably won't in most case... And it should allow combining, no ? BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ? Its not supported on some chips ? Not that this lead me to discover annother weird thing ... What on earth is pgprot_device() ? This is new ? On ARM it will be MT_DEVICE_nGnRE, so it allows posted write. It seems to match what ioremap does. Should then ioremap use it as well ? But it's only ever used for PCI mmap. Why is it different from pgprot_noncached() which disables posted writes (nE) ? Because a whole lot of drivers will use pgprot_noncached() explicitly in either mmap or vmap, with the expectation that it's somewhat the same as what ioremap does... Cheers, Ben.
On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote: > > > To sum it up: > > > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > > message push model. Actually the message model is not necessarily related > > > to write combining a la x86, so we should probably come up with a better > > > and consistent naming. Enabling this patchset may trigger performance > > > regressions on mellanox drivers on arm64 - this ought to be > > > addressed. > > > > It is pretty clear now that the certain ARM chips that don't do write > > combining with pgprot_writecombine will performance regress if they > > are running a certain uncommon Mellanox configuration. I suspect these > > deployments are all running the out of tree patch for DEVICE_GRE > > though. > > I'm not sure I understand... > > Today those ARM chips will not use pgprot_writecombine (at least not > using that code path, they might still use it as the result of the > other path in the driver that can enable it). Not quite, upstream kernel will never use WC on those devices. DEVICE_GRE is not supported in upstream, arch_can_pci_mmap_wc() is always false and the WC tester will always fail. > With the patch, those device will now use MT_DEVICE_NC. Which doesn't do WC at all on some ARM implementations. > Why would that be a regression ? Using the WC submission flow when it doesn't work costs something like 10% performance vs using the non-WC flow. Like I said, the case where the driver can't self test probably doesn't intersect with the ARM implementations that can't do write combining, and if it did, the users probably run the out of tree driver that has the hacky stuff to make it use DEVICE_GRE. > BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ? > Its not supported on some chips ? It has alignment requirements drivers don't meet. We need a new concept of "write combining and I promise to do aligned access" > What on earth is pgprot_device() ? This is new ? On ARM it will be > MT_DEVICE_nGnRE, so it allows posted write. It seems to match what > ioremap does. Should then ioremap use it as well ? > > But it's only ever used for PCI mmap. Why is it different from > pgprot_noncached() which disables posted writes (nE) ? > > Because a whole lot of drivers will use pgprot_noncached() explicitly > in either mmap or vmap, with the expectation that it's somewhat the > same as what ioremap does... *boggle* Only sysfs uses pci_mmap_resource_range() any other driver exposing BAR pages, like VFIO dies not. Makes no sense at all it is different. Delete the ill defined pgprot_device() ? Nobody has complained something is wrong with VFIO in the 6 years since it was added... Jason
On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote: > Not quite, upstream kernel will never use WC on those > devices. DEVICE_GRE is not supported in upstream, > arch_can_pci_mmap_wc() is always false and the WC tester will always > fail. > > > With the patch, those device will now use MT_DEVICE_NC. > > Which doesn't do WC at all on some ARM implementations. Lovely... this is arm64 btw, still the case ? Also we could make this a variable rather than a constant and choose a more appropriate set of flags at boot time.... > > Why would that be a regression ? > > Using the WC submission flow when it doesn't work costs something like > 10% performance vs using the non-WC flow. You mean the driver uses a different path to the HW which ahs that overhead, not that MMIOs have that overhead right ? > Like I said, the case where the driver can't self test probably > doesn't intersect with the ARM implementations that can't do write > combining, and if it did, the users probably run the out of tree > driver that has the hacky stuff to make it use DEVICE_GRE. Ok. So you are saying to go for it and ignore that Mellanox case then ? :-) > > BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ? > > Its not supported on some chips ? > > It has alignment requirements drivers don't meet. We need a new > concept of "write combining and I promise to do aligned access" Ah yes, I remember. Right, we would need to provide new/better accessors for these kind of things. It's going to be a mess to find a common set that works for all archs. > > What on earth is pgprot_device() ? This is new ? On ARM it will be > > MT_DEVICE_nGnRE, so it allows posted write. It seems to match what > > ioremap does. Should then ioremap use it as well ? > > > > But it's only ever used for PCI mmap. Why is it different from > > pgprot_noncached() which disables posted writes (nE) ? > > > > Because a whole lot of drivers will use pgprot_noncached() explicitly > > in either mmap or vmap, with the expectation that it's somewhat the > > same as what ioremap does... > > *boggle* > > Only sysfs uses pci_mmap_resource_range() any other driver exposing > BAR pages, like VFIO dies not. Makes no sense at all it is different. > > Delete the ill defined pgprot_device() ? Nobody has complained > something is wrong with VFIO in the 6 years since it was added... I was wondering what it was, that's it ... Cheers, Ben.
On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote: > > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote: > > > > To sum it up: > > > > > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > > > message push model. Actually the message model is not necessarily related > > > > to write combining a la x86, so we should probably come up with a better > > > > and consistent naming. Enabling this patchset may trigger performance > > > > regressions on mellanox drivers on arm64 - this ought to be > > > > addressed. > > > > > > It is pretty clear now that the certain ARM chips that don't do write > > > combining with pgprot_writecombine will performance regress if they > > > are running a certain uncommon Mellanox configuration. I suspect these > > > deployments are all running the out of tree patch for DEVICE_GRE > > > though. > > > > I'm not sure I understand... > > > > Today those ARM chips will not use pgprot_writecombine (at least not > > using that code path, they might still use it as the result of the > > other path in the driver that can enable it). > > Not quite, upstream kernel will never use WC on those > devices. DEVICE_GRE is not supported in upstream, > arch_can_pci_mmap_wc() is always false and the WC tester will always > fail. > > > With the patch, those device will now use MT_DEVICE_NC. > > Which doesn't do WC at all on some ARM implementations. Is that just TX2? I remember that thing being weird where GRE performed better than NC, but I thought that was a one off (and the thing is dead). NC is more permissive than GRE, so I think that's the right one to use; i.e. we go for the fewest number of restrictions on the hardware. If somebody screws up the uarch, that's up to them. Will
On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote: > On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote: > > > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote: > > > > > To sum it up: > > > > > > > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > > > > message push model. Actually the message model is not necessarily related > > > > > to write combining a la x86, so we should probably come up with a better > > > > > and consistent naming. Enabling this patchset may trigger performance > > > > > regressions on mellanox drivers on arm64 - this ought to be > > > > > addressed. > > > > > > > > It is pretty clear now that the certain ARM chips that don't do write > > > > combining with pgprot_writecombine will performance regress if they > > > > are running a certain uncommon Mellanox configuration. I suspect these > > > > deployments are all running the out of tree patch for DEVICE_GRE > > > > though. > > > > > > I'm not sure I understand... > > > > > > Today those ARM chips will not use pgprot_writecombine (at least not > > > using that code path, they might still use it as the result of the > > > other path in the driver that can enable it). > > > > Not quite, upstream kernel will never use WC on those > > devices. DEVICE_GRE is not supported in upstream, > > arch_can_pci_mmap_wc() is always false and the WC tester will always > > fail. > > > > > With the patch, those device will now use MT_DEVICE_NC. > > > > Which doesn't do WC at all on some ARM implementations. > > Is that just TX2? I remember that thing being weird where GRE performed > better than NC, but I thought that was a one off (and the thing is dead). I recall something along these lines. Hopefully ARM updated the guidance to licensees. > NC is more permissive than GRE, so I think that's the right one to use; i.e. > we go for the fewest number of restrictions on the hardware. If somebody > screws up the uarch, that's up to them. I agree, Normal NC is better as long as the BAR can tolerate read side-effects.
On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote: > Is that just TX2? I remember that thing being weird where GRE performed > better than NC, but I thought that was a one off (and the thing is dead). It is at least TX2, yes, but it is not really dead, someone made a respectable supercomputer out of them that still has an operational life ahead of it. Jason
On Wed, Sep 16, 2020 at 05:59:24PM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote: > > Not quite, upstream kernel will never use WC on those > > devices. DEVICE_GRE is not supported in upstream, > > arch_can_pci_mmap_wc() is always false and the WC tester will always > > fail. > > > > > With the patch, those device will now use MT_DEVICE_NC. > > > > Which doesn't do WC at all on some ARM implementations. > > Lovely... this is arm64 btw, still the case ? Yep > Also we could make this a variable rather than a constant and choose > a more appropriate set of flags at boot time.... It is a function, so it could check the CPU ID for the known broken devices and block them. > > > Why would that be a regression ? > > > > Using the WC submission flow when it doesn't work costs something like > > 10% performance vs using the non-WC flow. > > You mean the driver uses a different path to the HW which ahs that > overhead, not that MMIOs have that overhead right ? The different path has overhead of doing extra useless MMIOs because they don't combine Jason
On Wed, Sep 16, 2020 at 05:59:24PM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote: > > > Like I said, the case where the driver can't self test probably > > doesn't intersect with the ARM implementations that can't do write > > combining, and if it did, the users probably run the out of tree > > driver that has the hacky stuff to make it use DEVICE_GRE. > > Ok. So you are saying to go for it and ignore that Mellanox case then ? > :-) Regarding Mellanox, as I wrote it in the beginning of this thread, you can ignore mlx5 driver. https://lore.kernel.org/linux-pci/20200910105419.GH421756@unreal/ Thanks
On Wed, Sep 16, 2020 at 09:12:26AM -0300, Jason Gunthorpe wrote: [...] > > You mean the driver uses a different path to the HW which ahs that > > overhead, not that MMIOs have that overhead right ? > > The different path has overhead of doing extra useless MMIOs because > they don't combine For my own information, this is IB user space driver code, correct ? It tries to mmap buffer as WC and if it succeeds write into it in an optimized fashion (that is just pure overhead on platforms where normal NC memory - ie WC on arm64 - does not do what the _architecture_ defines it should). Lorenzo
On Wed, Sep 16, 2020 at 03:09:14PM +0100, Lorenzo Pieralisi wrote: > On Wed, Sep 16, 2020 at 09:12:26AM -0300, Jason Gunthorpe wrote: > > [...] > > > > You mean the driver uses a different path to the HW which ahs that > > > overhead, not that MMIOs have that overhead right ? > > > > The different path has overhead of doing extra useless MMIOs because > > they don't combine > > For my own information, this is IB user space driver code, correct ? Yes, maybe DPDK too > It tries to mmap buffer as WC and if it succeeds write into it in an > optimized fashion (that is just pure overhead on platforms where > normal NC memory - ie WC on arm64 - does not do what the > _architecture_ defines it should). Right, pure overhead if large PCI-E TLPs are not delivered to the device. Jason
On Wed, Sep 16, 2020 at 09:48:52AM +0100, Catalin Marinas wrote: > On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote: > > On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote: > > > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote: > > > > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote: > > > > > > To sum it up: > > > > > > > > > > > > (1) RDMA drivers need a new mapping function/attribute to define their > > > > > > message push model. Actually the message model is not necessarily related > > > > > > to write combining a la x86, so we should probably come up with a better > > > > > > and consistent naming. Enabling this patchset may trigger performance > > > > > > regressions on mellanox drivers on arm64 - this ought to be > > > > > > addressed. > > > > > > > > > > It is pretty clear now that the certain ARM chips that don't do write > > > > > combining with pgprot_writecombine will performance regress if they > > > > > are running a certain uncommon Mellanox configuration. I suspect these > > > > > deployments are all running the out of tree patch for DEVICE_GRE > > > > > though. > > > > > > > > I'm not sure I understand... > > > > > > > > Today those ARM chips will not use pgprot_writecombine (at least not > > > > using that code path, they might still use it as the result of the > > > > other path in the driver that can enable it). > > > > > > Not quite, upstream kernel will never use WC on those > > > devices. DEVICE_GRE is not supported in upstream, > > > arch_can_pci_mmap_wc() is always false and the WC tester will always > > > fail. > > > > > > > With the patch, those device will now use MT_DEVICE_NC. > > > > > > Which doesn't do WC at all on some ARM implementations. > > > > Is that just TX2? I remember that thing being weird where GRE performed > > better than NC, but I thought that was a one off (and the thing is dead). > > I recall something along these lines. Hopefully ARM updated the guidance > to licensees. > > > NC is more permissive than GRE, so I think that's the right one to use; i.e. > > we go for the fewest number of restrictions on the hardware. If somebody > > screws up the uarch, that's up to them. > > I agree, Normal NC is better as long as the BAR can tolerate read > side-effects. That we don't know but if a prefetchable BAR can't tolerate read side effects this would be already a problem on eg x86 - that's the best we can hope for given the current PCI specs. +1 on normal NC. The only open point is whether we should make arch_can_pci_mmap_wc() return false on platforms like TX2. Thanks, Lorenzo
On Wed, Sep 16, 2020 at 03:15:02PM +0100, Lorenzo Pieralisi wrote: > On Wed, Sep 16, 2020 at 09:48:52AM +0100, Catalin Marinas wrote: > > On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote: > > > On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote: > > > > > With the patch, those device will now use MT_DEVICE_NC. > > > > > > > > Which doesn't do WC at all on some ARM implementations. > > > > > > Is that just TX2? I remember that thing being weird where GRE performed > > > better than NC, but I thought that was a one off (and the thing is dead). > > > > I recall something along these lines. Hopefully ARM updated the guidance > > to licensees. > > > > > NC is more permissive than GRE, so I think that's the right one to use; i.e. > > > we go for the fewest number of restrictions on the hardware. If somebody > > > screws up the uarch, that's up to them. > > > > I agree, Normal NC is better as long as the BAR can tolerate read > > side-effects. > > That we don't know but if a prefetchable BAR can't tolerate read > side effects this would be already a problem on eg x86 - that's > the best we can hope for given the current PCI specs. > > +1 on normal NC. The only open point is whether we should make > arch_can_pci_mmap_wc() return false on platforms like TX2. I lost track in this thread whether it matters. TX2 would need Device GRE for optimal performance but the kernel doesn't currently provide it anyway. We could expose a new memory type, aligned_wc ;).
On Wed, 2020-09-16 at 18:00 +0100, Catalin Marinas wrote: > > That we don't know but if a prefetchable BAR can't tolerate read > > side effects this would be already a problem on eg x86 - that's > > the best we can hope for given the current PCI specs. > > > > +1 on normal NC. The only open point is whether we should make > > arch_can_pci_mmap_wc() return false on platforms like TX2. > > I lost track in this thread whether it matters. TX2 would need Device > GRE for optimal performance but the kernel doesn't currently provide > it anyway. We could expose a new memory type, aligned_wc ;). Or ignore TX2 :-) Though Lorenzo has a point about making it return false for arch_can_pci_mmap_wc() if we really care enough. Cheers, Ben.
On Wed, 2020-09-16 at 09:12 -0300, Jason Gunthorpe wrote: > > Also we could make this a variable rather than a constant and > > choose > > a more appropriate set of flags at boot time.... > > It is a function, so it could check the CPU ID for the known broken > devices and block them. Sure, I meant in the abstract way. It's not a hot path so it doesnt have to be a static key. > > > > Why would that be a regression ? > > > > > > Using the WC submission flow when it doesn't work costs something > > > like > > > 10% performance vs using the non-WC flow. > > > > You mean the driver uses a different path to the HW which ahs that > > overhead, not that MMIOs have that overhead right ? > > The different path has overhead of doing extra useless MMIOs because > they don't combine I see. This might have to end up being a TX2 specific hack until the end of times... Cheers, Ben.
On Thu, Sep 17, 2020 at 09:59:28AM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2020-09-16 at 09:12 -0300, Jason Gunthorpe wrote: > > > Also we could make this a variable rather than a constant and > > > choose > > > a more appropriate set of flags at boot time.... > > > > It is a function, so it could check the CPU ID for the known broken > > devices and block them. > > Sure, I meant in the abstract way. It's not a hot path so it doesnt > have to be a static key. > > > > > > Why would that be a regression ? > > > > > > > > Using the WC submission flow when it doesn't work costs something > > > > like > > > > 10% performance vs using the non-WC flow. > > > > > > You mean the driver uses a different path to the HW which ahs that > > > overhead, not that MMIOs have that overhead right ? > > > > The different path has overhead of doing extra useless MMIOs because > > they don't combine > > I see. This might have to end up being a TX2 specific hack until the > end of times... True - hopefully on platforms that implement normal NC the architectural way will not trigger user space performance regressions. Unfortunately if we merge this patch we _do_ know from this thread that userspace will suffer from a perf regression on TX2. Either we ignore it or we write some code to prevent it (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 - possibly using the arm64 errata detection mechanism). Adding a new IO mapping API and use it in IB drivers won't solve the TX2 problem - since we still prefer normal NC over device GRE for "WC" mappings and we would have to "downgrade" TX2 somehow. Lorenzo
On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote: > Unfortunately if we merge this patch we _do_ know from this thread > that userspace will suffer from a perf regression on TX2. > > Either we ignore it or we write some code to prevent it > (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 - > possibly using the arm64 errata detection mechanism). If anyone complains they can send the change to arch_can_pci_mmap_wc() - I'm pretty sure nobody will complain due to mlx5 Jason
On Thu, Sep 17, 2020 at 08:32:21AM -0300, Jason Gunthorpe wrote: > On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote: > > Unfortunately if we merge this patch we _do_ know from this thread > > that userspace will suffer from a perf regression on TX2. > > > > Either we ignore it or we write some code to prevent it > > (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 - > > possibly using the arm64 errata detection mechanism). > > If anyone complains they can send the change to arch_can_pci_mmap_wc() > - I'm pretty sure nobody will complain due to mlx5 If that's the case we can go ahead and merge this patch with a reworded commit log - this thread was very useful nonetheless for my own information (and others hopefully) so thank you. For VFIO WC + "message push ioremap" - to be continued. Clint, please reword the commit and resend, not sure we can hit v5.10 but we shall try. Thanks, Lorenzo
On Thu, Sep 17, 2020 at 03:01:16PM +0100, Lorenzo Pieralisi wrote: > On Thu, Sep 17, 2020 at 08:32:21AM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote: > > > Unfortunately if we merge this patch we _do_ know from this thread > > > that userspace will suffer from a perf regression on TX2. > > > > > > Either we ignore it or we write some code to prevent it > > > (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 - > > > possibly using the arm64 errata detection mechanism). > > > > If anyone complains they can send the change to arch_can_pci_mmap_wc() > > - I'm pretty sure nobody will complain due to mlx5 > > If that's the case we can go ahead and merge this patch with a > reworded commit log - this thread was very useful nonetheless > for my own information (and others hopefully) so thank you. > > For VFIO WC + "message push ioremap" - to be continued. > > Clint, please reword the commit and resend, not sure we can hit > v5.10 but we shall try. If you ack it, I'll queue it ;) Will
diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h index 70b323cf8300..b33ca260e3c9 100644 --- a/arch/arm64/include/asm/pci.h +++ b/arch/arm64/include/asm/pci.h @@ -17,6 +17,7 @@ #define pcibios_assign_all_busses() \ (pci_has_flag(PCI_REASSIGN_ALL_BUS)) +#define arch_can_pci_mmap_wc() 1 #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1 extern int isa_dma_bridge_buggy;
Using write-combine is crucial for performance of PCI devices where significant amounts of transactions go over PCI BARs. arm64 supports write-combine PCI mappings, so the appropriate define has been added which will expose write-combine mappings under sysfs for prefetchable PCI resources. Signed-off-by: Clint Sbisa <csbisa@amazon.com> --- arch/arm64/include/asm/pci.h | 1 + 1 file changed, 1 insertion(+)