mbox series

[rdma-next,00/10] Enable relaxed ordering for ULPs

Message ID 20210405052404.213889-1-leon@kernel.org (mailing list archive)
Headers show
Series Enable relaxed ordering for ULPs | expand

Message

Leon Romanovsky April 5, 2021, 5:23 a.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:

Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec

With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec

Thanks

Avihai Horon (10):
  RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  RDMA/iser: Enable Relaxed Ordering
  RDMA/rtrs: Enable Relaxed Ordering
  RDMA/srp: Enable Relaxed Ordering
  nvme-rdma: Enable Relaxed Ordering
  cifs: smbd: Enable Relaxed Ordering
  net/rds: Enable Relaxed Ordering
  net/smc: Enable Relaxed Ordering
  xprtrdma: Enable Relaxed Ordering

 drivers/infiniband/core/mr_pool.c             |  7 +-
 drivers/infiniband/core/rw.c                  | 12 ++--
 drivers/infiniband/core/verbs.c               | 26 +++++--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h      |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h        |  2 +-
 drivers/infiniband/hw/cxgb4/mem.c             |  2 +-
 drivers/infiniband/hw/hns/hns_roce_device.h   |  2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c       |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c     |  3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h          |  2 +-
 drivers/infiniband/hw/mlx4/mr.c               |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c               | 61 ++++++++--------
 drivers/infiniband/hw/mlx5/wr.c               | 69 ++++++++++++++-----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
 drivers/infiniband/hw/qedr/verbs.c            |  2 +-
 drivers/infiniband/hw/qedr/verbs.h            |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  4 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c             |  3 +-
 drivers/infiniband/sw/rdmavt/mr.h             |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.h         |  2 +-
 drivers/infiniband/ulp/iser/iser_memory.c     | 10 ++-
 drivers/infiniband/ulp/iser/iser_verbs.c      |  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c        |  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-srv.c        | 15 ++--
 drivers/infiniband/ulp/srp/ib_srp.c           |  8 +--
 drivers/nvme/host/rdma.c                      | 19 +++--
 fs/cifs/smbdirect.c                           | 17 +++--
 include/rdma/ib_verbs.h                       | 11 ++-
 include/rdma/mr_pool.h                        |  3 +-
 net/rds/ib_frmr.c                             |  7 +-
 net/smc/smc_ib.c                              |  3 +-
 net/smc/smc_wr.c                              |  3 +-
 net/sunrpc/xprtrdma/frwr_ops.c                | 10 +--
 39 files changed, 209 insertions(+), 140 deletions(-)

Comments

Christoph Hellwig April 5, 2021, 1:41 p.m. UTC | #1
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> >From Avihai,
> 
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
> 
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
> 
> The following test results show the performance improvement achieved
> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> to check performance of storage infrastructure over xprtrdma:

Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
What does that have to do with storage protocols?

Also if you enable this for basically all kernel ULPs, why not have
an opt-out into strict ordering for the cases that need it (if there are
any).
Leon Romanovsky April 5, 2021, 2:08 p.m. UTC | #2
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.

The performance boost is seen in other systems too.

> 
> Also if you enable this for basically all kernel ULPs, why not have
> an opt-out into strict ordering for the cases that need it (if there are
> any).

The RO property is optional, it can only improve. In addition, all in-kernel ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

Thanks
Santosh Shilimkar April 5, 2021, 4:11 p.m. UTC | #3
> On Apr 5, 2021, at 7:08 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>> 
>>>> From Avihai,
>>> 
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>> 
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>> 
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>> 
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
> 
> The performance boost is seen in other systems too.
> 
>> 
>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
> 
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
> 

Sticking to in-kernel ULP’s don’t need strict ordering, why don’t you enable this
for HCA’s which supports it by default instead of patching very ULPs. Some ULPs
in future may forget to specify such flag.

Regards,
Santosh
Tom Talpey April 5, 2021, 5:54 p.m. UTC | #4
On 4/5/2021 10:08 AM, Leon Romanovsky wrote:
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> >From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
> 
> The performance boost is seen in other systems too.

We need to see more information about this test, and platform.

What correctness testing was done, and how was it verified? What
PCI bus type(s) were tested, and with what adapters? What storage
workload was generated, and were all possible RDMA exchanges by
each ULP exercised?

>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
> 
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

+1 on Christoph's comment.

I woud hope most well-architected ULPs will support relaxed ordering,
but storage workloads, in my experience, can find ways to cause failure
in adapters. I would not suggest making this the default behavior
without extensive testing.

Tom.
Jason Gunthorpe April 5, 2021, 8:07 p.m. UTC | #5
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

I think it is a typo (or at least mit makes no sense to be talking
about NFS with a GPU chip) Probably it should be a DGX A100 which is a
dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
workload.

AMD dual socket systems are well known to benefit from relaxed
ordering, people have been doing this in userspace for a while now
with the opt in.

What surprises me is the performance difference, I hadn't heard it is
4x!

Jason
Chuck Lever III April 5, 2021, 11:42 p.m. UTC | #6
> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>> 
>>>> From Avihai,
>>> 
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>> 
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>> 
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>> 
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> I think it is a typo (or at least mit makes no sense to be talking
> about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> workload.

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

I have an old Haswell dual-socket system in my lab, but otherwise
I'm not sure I have a platform that would be interesting for such a
test.


> AMD dual socket systems are well known to benefit from relaxed
> ordering, people have been doing this in userspace for a while now
> with the opt in.


--
Chuck Lever
Keith Busch April 5, 2021, 11:50 p.m. UTC | #7
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>> 
> >>>> From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

Not sure if Haswell will be useful for such testing. It looks like many
of those subscribe to 'quirk_relaxedordering_disable'.
Honggang LI April 6, 2021, 2:37 a.m. UTC | #8
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> From Avihai,
> 
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
> 
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
> 
> The following test results show the performance improvement achieved

Did you test this patchset with CPU does not support relaxed ordering?

We observed significantly performance degradation when run perftest with
relaxed ordering enabled over old CPU.

https://github.com/linux-rdma/perftest/issues/116

thanks
Leon Romanovsky April 6, 2021, 5:09 a.m. UTC | #9
On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> 
> Did you test this patchset with CPU does not support relaxed ordering?

I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
and they are not interesting from performance point of view.

> 
> We observed significantly performance degradation when run perftest with
> relaxed ordering enabled over old CPU.
> 
> https://github.com/linux-rdma/perftest/issues/116

The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.

Thanks

> 
> thanks
>
Leon Romanovsky April 6, 2021, 5:12 a.m. UTC | #10
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> 
> 
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>> 
> >>>> From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

I will ask to provide more details.

> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

We don't have such old systems too.

> 
> 
> > AMD dual socket systems are well known to benefit from relaxed
> > ordering, people have been doing this in userspace for a while now
> > with the opt in.
> 
> 
> --
> Chuck Lever
> 
> 
>
Jason Gunthorpe April 6, 2021, 11:49 a.m. UTC | #11
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.

On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

Still, obviously people should test on the platforms they have.

Jason
Jason Gunthorpe April 6, 2021, 11:53 a.m. UTC | #12
On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > From Avihai,
> > > 
> > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > > imposed on PCI transactions, and thus, can improve performance.
> > > 
> > > Until now, relaxed ordering could be set only by user space applications
> > > for user MRs. The following patch series enables relaxed ordering for the
> > > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > > such, it is ignored by vendors that don't support it.
> > > 
> > > The following test results show the performance improvement achieved
> > 
> > Did you test this patchset with CPU does not support relaxed ordering?
> 
> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
> and they are not interesting from performance point of view.
> 
> > 
> > We observed significantly performance degradation when run perftest with
> > relaxed ordering enabled over old CPU.
> > 
> > https://github.com/linux-rdma/perftest/issues/116
> 
> The perftest is slightly different, but you pointed to the valid point.
> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
> and arguably this was needed to be done in perftest too.

No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.

There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.

Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.

This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.

I thought the old haswell systems were quirked to disable RO globally
anyhow?

Jason
Tom Talpey April 9, 2021, 2:26 p.m. UTC | #13
On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>   
>> We need to get a better idea what correctness testing has been done,
>> and whether positive correctness testing results can be replicated
>> on a variety of platforms.
> 
> RO has been rolling out slowly on mlx5 over a few years and storage
> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> turned on for a long time, userspace HPC applications have been using
> it for a while now too.

I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.

> We know there are platforms with broken RO implementations (like
> Haswell) but the kernel is supposed to globally turn off RO on all
> those cases. I'd be a bit surprised if we discover any more from this
> series.
> 
> On the other hand there are platforms that get huge speed ups from
> turning this on, AMD is one example, there are a bunch in the ARM
> world too.

My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.

The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.

> Still, obviously people should test on the platforms they have.

Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.

Tom.
Chuck Lever III April 9, 2021, 2:45 p.m. UTC | #14
> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>  
>>> We need to get a better idea what correctness testing has been done,
>>> and whether positive correctness testing results can be replicated
>>> on a variety of platforms.
>> RO has been rolling out slowly on mlx5 over a few years and storage
>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>> turned on for a long time, userspace HPC applications have been using
>> it for a while now too.
> 
> I'd love to see RO be used more, it was always something the RDMA
> specs supported and carefully architected for. My only concern is
> that it's difficult to get right, especially when the platforms
> have been running strictly-ordered for so long. The ULPs need
> testing, and a lot of it.
> 
>> We know there are platforms with broken RO implementations (like
>> Haswell) but the kernel is supposed to globally turn off RO on all
>> those cases. I'd be a bit surprised if we discover any more from this
>> series.
>> On the other hand there are platforms that get huge speed ups from
>> turning this on, AMD is one example, there are a bunch in the ARM
>> world too.
> 
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly). The RO pipeline will completely reorder
> DMA writes, and consumers which infer ordering from memory contents may
> break. This can even apply within the provider code, which may attempt
> to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


> The Mellanox adapter, itself, historically has strict in-order DMA
> semantics, and while it's great to relax that, changing it by default
> for all consumers is something to consider very cautiously.
> 
>> Still, obviously people should test on the platforms they have.
> 
> Yes, and "test" be taken seriously with focus on ULP data integrity.
> Speedups will mean nothing if the data is ever damaged.

I agree that data integrity comes first.

Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.


--
Chuck Lever
Tom Talpey April 9, 2021, 3:32 p.m. UTC | #15
On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> 
> 
>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>   
>>>> We need to get a better idea what correctness testing has been done,
>>>> and whether positive correctness testing results can be replicated
>>>> on a variety of platforms.
>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>> turned on for a long time, userspace HPC applications have been using
>>> it for a while now too.
>>
>> I'd love to see RO be used more, it was always something the RDMA
>> specs supported and carefully architected for. My only concern is
>> that it's difficult to get right, especially when the platforms
>> have been running strictly-ordered for so long. The ULPs need
>> testing, and a lot of it.
>>
>>> We know there are platforms with broken RO implementations (like
>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>> those cases. I'd be a bit surprised if we discover any more from this
>>> series.
>>> On the other hand there are platforms that get huge speed ups from
>>> turning this on, AMD is one example, there are a bunch in the ARM
>>> world too.
>>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly). The RO pipeline will completely reorder
>> DMA writes, and consumers which infer ordering from memory contents may
>> break. This can even apply within the provider code, which may attempt
>> to poll WR and CQ structures, and be tripped up.
> 
> You are referring specifically to RPC/RDMA depending on Receive
> completions to guarantee that previous RDMA Writes have been
> retired? Or is there a particular implementation practice in
> the Linux RPC/RDMA code that worries you?

Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
	PCIe RO Write for data
	PCIe RO Write for data
	...
RDMA Write arrives
	PCIe RO Write for data
	...
RDMA Send arrives
	PCIe RO Write for receive data
	PCIe RO Write for receive descriptor
	PCIe interrupt (flushes RO pipeline for all three ops above)

RPC/RDMA polls CQ
	Reaps receive completion

RDMA Send arrives
	PCIe RO Write for receive data
	PCIe RO write for receive descriptor
	Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
	Reaps receive completion
	PCIe RO writes not yet flushed
	Processes incomplete in-memory data
	Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.

>> The Mellanox adapter, itself, historically has strict in-order DMA
>> semantics, and while it's great to relax that, changing it by default
>> for all consumers is something to consider very cautiously.
>>
>>> Still, obviously people should test on the platforms they have.
>>
>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>> Speedups will mean nothing if the data is ever damaged.
> 
> I agree that data integrity comes first.
> 
> Since I currently don't have facilities to test RO in my lab, the
> community will have to agree on a set of tests and expected results
> that specifically exercise the corner cases you are concerned about.
> 
> 
> --
> Chuck Lever
> 
> 
> 
>
Haakon Bugge April 9, 2021, 4:27 p.m. UTC | #16
> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
> 
> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>  
>>>>> We need to get a better idea what correctness testing has been done,
>>>>> and whether positive correctness testing results can be replicated
>>>>> on a variety of platforms.
>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>> turned on for a long time, userspace HPC applications have been using
>>>> it for a while now too.
>>> 
>>> I'd love to see RO be used more, it was always something the RDMA
>>> specs supported and carefully architected for. My only concern is
>>> that it's difficult to get right, especially when the platforms
>>> have been running strictly-ordered for so long. The ULPs need
>>> testing, and a lot of it.
>>> 
>>>> We know there are platforms with broken RO implementations (like
>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>> series.
>>>> On the other hand there are platforms that get huge speed ups from
>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>> world too.
>>> 
>>> My belief is that the biggest risk is from situations where completions
>>> are batched, and therefore polling is used to detect them without
>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>> DMA writes, and consumers which infer ordering from memory contents may
>>> break. This can even apply within the provider code, which may attempt
>>> to poll WR and CQ structures, and be tripped up.
>> You are referring specifically to RPC/RDMA depending on Receive
>> completions to guarantee that previous RDMA Writes have been
>> retired? Or is there a particular implementation practice in
>> the Linux RPC/RDMA code that worries you?
> 
> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> is hopefully unfounded, is that the RO pipeline might not have flushed
> when a completion is posted *after* posting an interrupt.
> 
> Something like this...
> 
> RDMA Write arrives
> 	PCIe RO Write for data
> 	PCIe RO Write for data
> 	...
> RDMA Write arrives
> 	PCIe RO Write for data
> 	...
> RDMA Send arrives
> 	PCIe RO Write for receive data
> 	PCIe RO Write for receive descriptor

Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.



> 	PCIe interrupt (flushes RO pipeline for all three ops above)

Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.

And the MSI-X write likewise, to avoid spurious interrupts.



Thxs, Håkon

> 
> RPC/RDMA polls CQ
> 	Reaps receive completion
> 
> RDMA Send arrives
> 	PCIe RO Write for receive data
> 	PCIe RO write for receive descriptor
> 	Does *not* interrupt, since CQ not armed
> 
> RPC/RDMA continues to poll CQ
> 	Reaps receive completion
> 	PCIe RO writes not yet flushed
> 	Processes incomplete in-memory data
> 	Bzzzt
> 
> Hopefully, the adapter performs a PCIe flushing read, or something
> to avoid this when an interrupt is not generated. Alternatively, I'm
> overly paranoid.
> 
> Tom.
> 
>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>> semantics, and while it's great to relax that, changing it by default
>>> for all consumers is something to consider very cautiously.
>>> 
>>>> Still, obviously people should test on the platforms they have.
>>> 
>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>> Speedups will mean nothing if the data is ever damaged.
>> I agree that data integrity comes first.
>> Since I currently don't have facilities to test RO in my lab, the
>> community will have to agree on a set of tests and expected results
>> that specifically exercise the corner cases you are concerned about.
>> --
>> Chuck Lever
Jason Gunthorpe April 9, 2021, 4:40 p.m. UTC | #17
On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:

> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly).

We don't do this in the kernel.

All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.

ULPs have to be designed this way to use the DMA API properly.
Fencing a DMA before it is completed by the HW will cause IOMMU
errors.

Userspace is a different story, but that will remain as-is with
optional relaxed ordering.

Jason
Tom Talpey April 9, 2021, 5:44 p.m. UTC | #18
On 4/9/2021 12:40 PM, Jason Gunthorpe wrote:
> On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:
> 
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly).
> 
> We don't do this in the kernel.
> 
> All kernel ULPs only read data after they observe the CQE. We do not
> have "last data polling" and our interrupt model does not support some
> hacky "interrupt means go and use the data" approach.
> 
> ULPs have to be designed this way to use the DMA API properly.

Yep. Totally agree.

My concern was about the data being written as relaxed, and the CQE
racing it. I'll reply in the other fork.


> Fencing a DMA before it is completed by the HW will cause IOMMU
> errors.
> 
> Userspace is a different story, but that will remain as-is with
> optional relaxed ordering.
> 
> Jason
>
Tom Talpey April 9, 2021, 5:49 p.m. UTC | #19
On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> 
> 
>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>   
>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>> and whether positive correctness testing results can be replicated
>>>>>> on a variety of platforms.
>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>> turned on for a long time, userspace HPC applications have been using
>>>>> it for a while now too.
>>>>
>>>> I'd love to see RO be used more, it was always something the RDMA
>>>> specs supported and carefully architected for. My only concern is
>>>> that it's difficult to get right, especially when the platforms
>>>> have been running strictly-ordered for so long. The ULPs need
>>>> testing, and a lot of it.
>>>>
>>>>> We know there are platforms with broken RO implementations (like
>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>> series.
>>>>> On the other hand there are platforms that get huge speed ups from
>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>> world too.
>>>>
>>>> My belief is that the biggest risk is from situations where completions
>>>> are batched, and therefore polling is used to detect them without
>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>> break. This can even apply within the provider code, which may attempt
>>>> to poll WR and CQ structures, and be tripped up.
>>> You are referring specifically to RPC/RDMA depending on Receive
>>> completions to guarantee that previous RDMA Writes have been
>>> retired? Or is there a particular implementation practice in
>>> the Linux RPC/RDMA code that worries you?
>>
>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>> is hopefully unfounded, is that the RO pipeline might not have flushed
>> when a completion is posted *after* posting an interrupt.
>>
>> Something like this...
>>
>> RDMA Write arrives
>> 	PCIe RO Write for data
>> 	PCIe RO Write for data
>> 	...
>> RDMA Write arrives
>> 	PCIe RO Write for data
>> 	...
>> RDMA Send arrives
>> 	PCIe RO Write for receive data
>> 	PCIe RO Write for receive descriptor
> 
> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!

>> 	PCIe interrupt (flushes RO pipeline for all three ops above)
> 
> Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.
> 
> And the MSI-X write likewise, to avoid spurious interrupts.

Ok, and yes agreed the same principle would apply.

Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?

Tom.

> Thxs, Håkon
> 
>>
>> RPC/RDMA polls CQ
>> 	Reaps receive completion
>>
>> RDMA Send arrives
>> 	PCIe RO Write for receive data
>> 	PCIe RO write for receive descriptor
>> 	Does *not* interrupt, since CQ not armed
>>
>> RPC/RDMA continues to poll CQ
>> 	Reaps receive completion
>> 	PCIe RO writes not yet flushed
>> 	Processes incomplete in-memory data
>> 	Bzzzt
>>
>> Hopefully, the adapter performs a PCIe flushing read, or something
>> to avoid this when an interrupt is not generated. Alternatively, I'm
>> overly paranoid.
>>
>> Tom.
>>
>>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>>> semantics, and while it's great to relax that, changing it by default
>>>> for all consumers is something to consider very cautiously.
>>>>
>>>>> Still, obviously people should test on the platforms they have.
>>>>
>>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>>> Speedups will mean nothing if the data is ever damaged.
>>> I agree that data integrity comes first.
>>> Since I currently don't have facilities to test RO in my lab, the
>>> community will have to agree on a set of tests and expected results
>>> that specifically exercise the corner cases you are concerned about.
>>> --
>>> Chuck Lever
>
David Laight April 10, 2021, 1:30 p.m. UTC | #20
From: Tom Talpey
> Sent: 09 April 2021 18:49
> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> >
> >
> >> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
> >>
> >> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
> >>>>
> >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>> We need to get a better idea what correctness testing has been done,
> >>>>>> and whether positive correctness testing results can be replicated
> >>>>>> on a variety of platforms.
> >>>>> RO has been rolling out slowly on mlx5 over a few years and storage
> >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> >>>>> turned on for a long time, userspace HPC applications have been using
> >>>>> it for a while now too.
> >>>>
> >>>> I'd love to see RO be used more, it was always something the RDMA
> >>>> specs supported and carefully architected for. My only concern is
> >>>> that it's difficult to get right, especially when the platforms
> >>>> have been running strictly-ordered for so long. The ULPs need
> >>>> testing, and a lot of it.
> >>>>
> >>>>> We know there are platforms with broken RO implementations (like
> >>>>> Haswell) but the kernel is supposed to globally turn off RO on all
> >>>>> those cases. I'd be a bit surprised if we discover any more from this
> >>>>> series.
> >>>>> On the other hand there are platforms that get huge speed ups from
> >>>>> turning this on, AMD is one example, there are a bunch in the ARM
> >>>>> world too.
> >>>>
> >>>> My belief is that the biggest risk is from situations where completions
> >>>> are batched, and therefore polling is used to detect them without
> >>>> interrupts (which explicitly). The RO pipeline will completely reorder
> >>>> DMA writes, and consumers which infer ordering from memory contents may
> >>>> break. This can even apply within the provider code, which may attempt
> >>>> to poll WR and CQ structures, and be tripped up.
> >>> You are referring specifically to RPC/RDMA depending on Receive
> >>> completions to guarantee that previous RDMA Writes have been
> >>> retired? Or is there a particular implementation practice in
> >>> the Linux RPC/RDMA code that worries you?
> >>
> >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> >> is hopefully unfounded, is that the RO pipeline might not have flushed
> >> when a completion is posted *after* posting an interrupt.
> >>
> >> Something like this...
> >>
> >> RDMA Write arrives
> >> 	PCIe RO Write for data
> >> 	PCIe RO Write for data
> >> 	...
> >> RDMA Write arrives
> >> 	PCIe RO Write for data
> >> 	...
> >> RDMA Send arrives
> >> 	PCIe RO Write for receive data
> >> 	PCIe RO Write for receive descriptor
> >
> > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
> it will shure prior written RO date has global visibility when the CQE can be observed.
> 
> I wasn't aware that a strongly-ordered PCIe Write will ensure that
> prior relaxed-ordered writes went first. If that's the case, I'm
> fine with it - as long as the providers are correctly coded!!

I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.

I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?

It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!

For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.

Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).

So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes complete first.

(The furthest I've got into PCIe internals was fixing the bug
in some vendor-supplied FPGA logic that failed to correctly
handle multiple data TLP responses to a single read TLP.
Fortunately it wasn't in the hard-IP bit.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Max Gurtovoy April 11, 2021, 10:09 a.m. UTC | #21
On 4/6/2021 2:53 PM, Jason Gunthorpe wrote:
> On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
>> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
>>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>
>>>>  From Avihai,
>>>>
>>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>>> imposed on PCI transactions, and thus, can improve performance.
>>>>
>>>> Until now, relaxed ordering could be set only by user space applications
>>>> for user MRs. The following patch series enables relaxed ordering for the
>>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>>> such, it is ignored by vendors that don't support it.
>>>>
>>>> The following test results show the performance improvement achieved
>>> Did you test this patchset with CPU does not support relaxed ordering?
>> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
>> and they are not interesting from performance point of view.
>>
>>> We observed significantly performance degradation when run perftest with
>>> relaxed ordering enabled over old CPU.
>>>
>>> https://github.com/linux-rdma/perftest/issues/116
>> The perftest is slightly different, but you pointed to the valid point.
>> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
>> and arguably this was needed to be done in perftest too.
> No, the PCI device should not have the RO bit set in this situation.
> It is something mlx5_core needs to do. We can't push this into
> applications.

pcie_relaxed_ordering_enabled is called in 
drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to 
move it to

mlx5_core in this series.



>
> There should be no performance difference from asking for
> IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
> not asking for it at all.
>
> Either the platform has working relaxed ordering that gives a
> performance gain and the RO config spec bit should be set, or it
> doesn't and the bit should be clear.

is this the case today ?

>
> This is not something to decide in userspace, or in RDMA. At worst it
> becomes another platform specific PCI tunable people have to set.
>
> I thought the old haswell systems were quirked to disable RO globally
> anyhow?
>
> Jason
Haakon Bugge April 12, 2021, 6:32 p.m. UTC | #22
> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote:
> 
> From: Tom Talpey
>> Sent: 09 April 2021 18:49
>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>> 
>>> 
>>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>>> 
>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>> 
>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>> on a variety of platforms.
>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>> it for a while now too.
>>>>>> 
>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>> specs supported and carefully architected for. My only concern is
>>>>>> that it's difficult to get right, especially when the platforms
>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>> testing, and a lot of it.
>>>>>> 
>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>> series.
>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>> world too.
>>>>>> 
>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>> are batched, and therefore polling is used to detect them without
>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>> completions to guarantee that previous RDMA Writes have been
>>>>> retired? Or is there a particular implementation practice in
>>>>> the Linux RPC/RDMA code that worries you?
>>>> 
>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>> when a completion is posted *after* posting an interrupt.
>>>> 
>>>> Something like this...
>>>> 
>>>> RDMA Write arrives
>>>> 	PCIe RO Write for data
>>>> 	PCIe RO Write for data
>>>> 	...
>>>> RDMA Write arrives
>>>> 	PCIe RO Write for data
>>>> 	...
>>>> RDMA Send arrives
>>>> 	PCIe RO Write for receive data
>>>> 	PCIe RO Write for receive descriptor
>>> 
>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>> it will shure prior written RO date has global visibility when the CQE can be observed.
>> 
>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>> prior relaxed-ordered writes went first. If that's the case, I'm
>> fine with it - as long as the providers are correctly coded!!

The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):

	A Posted Request must not pass another Posted Request unless A2b applies.

	A2b: A Posted Request with RO Set is permitted to pass another Posted Request.


Thxs, Håkon

> 
> I remember trying to read the relevant section of the PCIe spec.
> (Possibly in a book that was trying to make it easier to understand!)
> It is about as clear as mud.
> 
> I presume this is all about allowing PCIe targets (eg ethernet cards)
> to use relaxed ordering on write requests to host memory.
> And that such writes can be completed out of order?
> 
> It isn't entirely clear that you aren't talking of letting the
> cpu do 'relaxed order' writes to PCIe targets!
> 
> For a typical ethernet driver the receive interrupt just means
> 'go and look at the receive descriptor ring'.
> So there is an absolute requirement that the writes for data
> buffer complete before the write to the receive descriptor.
> There is no requirement for the interrupt (requested after the
> descriptor write) to have been seen by the cpu.
> 
> Quite often the driver will find the 'receive complete'
> descriptor when processing frames from an earlier interrupt
> (and nothing to do in response to the interrupt itself).
> 
> So the write to the receive descriptor would have to have RO clear
> to ensure that all the buffer writes complete first.
> 
> (The furthest I've got into PCIe internals was fixing the bug
> in some vendor-supplied FPGA logic that failed to correctly
> handle multiple data TLP responses to a single read TLP.
> Fortunately it wasn't in the hard-IP bit.)
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Tom Talpey April 12, 2021, 8:20 p.m. UTC | #23
On 4/12/2021 2:32 PM, Haakon Bugge wrote:
> 
> 
>> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote:
>>
>> From: Tom Talpey
>>> Sent: 09 April 2021 18:49
>>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>>>
>>>>
>>>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>>
>>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>>>
>>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>>> on a variety of platforms.
>>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>>> it for a while now too.
>>>>>>>
>>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>>> specs supported and carefully architected for. My only concern is
>>>>>>> that it's difficult to get right, especially when the platforms
>>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>>> testing, and a lot of it.
>>>>>>>
>>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>>> series.
>>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>>> world too.
>>>>>>>
>>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>>> are batched, and therefore polling is used to detect them without
>>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>>> completions to guarantee that previous RDMA Writes have been
>>>>>> retired? Or is there a particular implementation practice in
>>>>>> the Linux RPC/RDMA code that worries you?
>>>>>
>>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>>> when a completion is posted *after* posting an interrupt.
>>>>>
>>>>> Something like this...
>>>>>
>>>>> RDMA Write arrives
>>>>> 	PCIe RO Write for data
>>>>> 	PCIe RO Write for data
>>>>> 	...
>>>>> RDMA Write arrives
>>>>> 	PCIe RO Write for data
>>>>> 	...
>>>>> RDMA Send arrives
>>>>> 	PCIe RO Write for receive data
>>>>> 	PCIe RO Write for receive descriptor
>>>>
>>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>>> it will shure prior written RO date has global visibility when the CQE can be observed.
>>>
>>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>>> prior relaxed-ordered writes went first. If that's the case, I'm
>>> fine with it - as long as the providers are correctly coded!!
> 
> The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):
> 
> 	A Posted Request must not pass another Posted Request unless A2b applies.
> 
> 	A2b: A Posted Request with RO Set is permitted to pass another Posted Request.
> 
> 
> Thxs, Håkon

Ok, good - a non-RO write (for example, to a CQE), or an interrupt
(which would be similarly non-RO), will "get behind" all prior writes.

So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.

Honestly, the Haswell sounds like a great first candidate, because
if it has a known-broken RO behavior, verifying that it works with
this change is highly important. I'd have greater confidence in newer
platforms, in other words. They *all* have to work, proveably.

Tom.

>> I remember trying to read the relevant section of the PCIe spec.
>> (Possibly in a book that was trying to make it easier to understand!)
>> It is about as clear as mud.
>>
>> I presume this is all about allowing PCIe targets (eg ethernet cards)
>> to use relaxed ordering on write requests to host memory.
>> And that such writes can be completed out of order?
>>
>> It isn't entirely clear that you aren't talking of letting the
>> cpu do 'relaxed order' writes to PCIe targets!
>>
>> For a typical ethernet driver the receive interrupt just means
>> 'go and look at the receive descriptor ring'.
>> So there is an absolute requirement that the writes for data
>> buffer complete before the write to the receive descriptor.
>> There is no requirement for the interrupt (requested after the
>> descriptor write) to have been seen by the cpu.
>>
>> Quite often the driver will find the 'receive complete'
>> descriptor when processing frames from an earlier interrupt
>> (and nothing to do in response to the interrupt itself).
>>
>> So the write to the receive descriptor would have to have RO clear
>> to ensure that all the buffer writes complete first.
>>
>> (The furthest I've got into PCIe internals was fixing the bug
>> in some vendor-supplied FPGA logic that failed to correctly
>> handle multiple data TLP responses to a single read TLP.
>> Fortunately it wasn't in the hard-IP bit.)
>>
>> 	David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>
Jason Gunthorpe April 12, 2021, 10:48 p.m. UTC | #24
On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:

> So the issue is only in testing all the providers and platforms,
> to be sure this new behavior isn't tickling anything that went
> unnoticed all along, because no RDMA provider ever issued RO.

The mlx5 ethernet driver has run in RO mode for a long time, and it
operates in basically the same way as RDMA. The issues with Haswell
have been worked out there already.

The only open question is if the ULPs have errors in their
implementation, which I don't think we can find out until we apply
this series and people start running their tests aggressively.

Jason
Tom Talpey April 14, 2021, 2:16 p.m. UTC | #25
On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> 
>> So the issue is only in testing all the providers and platforms,
>> to be sure this new behavior isn't tickling anything that went
>> unnoticed all along, because no RDMA provider ever issued RO.
> 
> The mlx5 ethernet driver has run in RO mode for a long time, and it
> operates in basically the same way as RDMA. The issues with Haswell
> have been worked out there already.
> 
> The only open question is if the ULPs have errors in their
> implementation, which I don't think we can find out until we apply
> this series and people start running their tests aggressively.

I agree that the core RO support should go in. But turning it on
by default for a ULP should be the decision of each ULP maintainer.
It's a huge risk to shift all the storage drivers overnight. How
do you propose to ensure the aggressive testing happens?

One thing that worries me is the patch02 on-by-default for the dma_lkey.
There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
from being set in __ib_alloc_pd().

Tom.
David Laight April 14, 2021, 2:41 p.m. UTC | #26
From: Tom Talpey
> Sent: 14 April 2021 15:16
> 
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> >
> >> So the issue is only in testing all the providers and platforms,
> >> to be sure this new behavior isn't tickling anything that went
> >> unnoticed all along, because no RDMA provider ever issued RO.
> >
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> >
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
> 
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?
> 
> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().

What is a ULP in this context?

I've presumed that this is all about getting PCIe targets (ie cards)
to set the RO (relaxed ordering) bit in some of the write TLP they
generate for writing to host memory?

So whatever driver initialises the target needs to configure whatever
target-specific register enables the RO transfers themselves.

After that there could be flags in the PCIe config space of the target
and any bridges that clear the RO flag.

There could also be flags in the bridges and root complex to ignore
the RO flag even if it is set.

Then the Linux kernel can have option(s) to tell the driver not
to enable RO - even though the driver believes it should all work.
This could be a single global flag, or fin-grained in some way.

So what exactly is this patch series doing?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason Gunthorpe April 14, 2021, 2:44 p.m. UTC | #27
On Wed, Apr 14, 2021 at 10:16:28AM -0400, Tom Talpey wrote:
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> > 
> > > So the issue is only in testing all the providers and platforms,
> > > to be sure this new behavior isn't tickling anything that went
> > > unnoticed all along, because no RDMA provider ever issued RO.
> > 
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> > 
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
> 
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?

Realistically we do test most of the RDMA storage ULPs at NVIDIA over
mlx5 which is the only HW that will enable this for now.

I disagree it is a "huge risk".

Additional wider testing is welcomed and can happen over the 16 week
release cycle for a kernel. I would aim to get the relaxed ordering
changed merged to linux-next a week or so after the merge window.

Further testing happens before these changes would get picked up in a
distro on something like MLNX_OFED.

I don't think we need to make the patch design worse or over think the
submission process for something that, so far, hasn't discovered any
issues and alread has a proven track record in other ULPs.

Any storage ULP that has a problem here is mis-using verbs and the DMA
API and thus has an existing data-corruption bug that they are simply
lucky to have not yet discovered.

> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().

The ULPs are being forced into relaxed_ordering. They don't get to
turn it off one by one. The v2 will be more explicit about this as
there will be no ULP patches, just the verbs core code being updated.

Jason
Jason Gunthorpe April 14, 2021, 2:49 p.m. UTC | #28
On Wed, Apr 14, 2021 at 02:41:52PM +0000, David Laight wrote:

> So whatever driver initialises the target needs to configure whatever
> target-specific register enables the RO transfers themselves.

RDMA in general, and mlx5 in particular, is a layered design:

mlx5_core <- owns the PCI function, should turn on RO at the PCI
             function level
 mlx5_en  <- Commands the chip to use RO for queues used in ethernet
ib_core
  ib_uverbs
    mlx5_ib <- Commands the chip to use RO for some queues used in
               userspace
  ib_srp* <- A ULP driver built on RDMA - this patch commands the chip
             to use RO on SRP queues
  nvme-rdma <- Ditto
  ib_iser <- Ditto
  rds <- Ditto

So this series is about expanding the set of queues running on mlx5
that have RO turned when the PCI function is already running with RO
enabled.

We want as many queues as possible RO enabled because it brings big
performance wins

Jason