mbox series

[rdma-next,v1,0/2] Enable relaxed ordering for ULPs

Message ID cover.1621505111.git.leonro@nvidia.com (mailing list archive)
Headers show
Series Enable relaxed ordering for ULPs | expand

Message

Leon Romanovsky May 20, 2021, 10:13 a.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v1:
 * Enabled by default RO in IB/core instead of changing all users
v0: https://lore.kernel.org/lkml/20210405052404.213889-1-leon@kernel.org

------------------------------------------------------------------------
From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance for
applications that can handle this lack of strict ordering.

Currently, relaxed ordering can be set only by user space applications
for user MRs. Not all user space applications support relaxed ordering
and for this reason it was added as an optional capability that is
disabled by default. This behavior is not changed as part of this series,
and relaxed ordering remains disabled by default for user space.

On the other hand, kernel users should universally support relaxed
ordering, as they are designed to read data only after observing the CQE
and use the DMA API correctly. There are a few platforms with broken
relaxed ordering implementation, but for them relaxed ordering is expected
to be turned off globally in the PCI level. In addition, note that this is
not the first use of relaxed ordering. Relaxed ordering has been enabled
by default in mlx5 ethernet driver, and user space apps use it as well for
quite a while.

Hence, this series enabled relaxed ordering by default for kernel users so
they can benefit as well from the performance improvements.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed by running FIO traffic
between a NVIDIA DGX A100 (ConnectX-6 NICs and AMD CPUs) and a NVMe
storage fabric, using NFSoRDMA:

Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec

With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec

The series has been tested over NVMe, iSER, SRP and NFS with ConnectX-6
NIC. The tests included FIO verify and stress tests, and various
resiliency tests (shutting down NIC port in the middle of traffic,
rebooting the target in the middle of traffic etc.).

Thanks

Avihai Horon (2):
  RDMA: Enable Relaxed Ordering by default for kernel ULPs
  RDMA/mlx5: Allow modifying Relaxed Ordering via fast registration

 drivers/infiniband/core/umem.c                |  2 +-
 drivers/infiniband/core/uverbs_cmd.c          | 64 ++++++++++++++--
 drivers/infiniband/core/uverbs_std_types_mr.c | 21 ++++--
 drivers/infiniband/hw/mlx5/devx.c             | 10 ++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          | 10 ++-
 drivers/infiniband/hw/mlx5/mr.c               | 22 +++---
 drivers/infiniband/hw/mlx5/wr.c               | 74 ++++++++++++++-----
 include/rdma/ib_verbs.h                       | 68 ++++++++++++++++-
 8 files changed, 220 insertions(+), 51 deletions(-)

Comments

Jason Gunthorpe May 26, 2021, 7:30 p.m. UTC | #1
On Thu, May 20, 2021 at 01:13:34PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Changelog:
> v1:
>  * Enabled by default RO in IB/core instead of changing all users
> v0: https://lore.kernel.org/lkml/20210405052404.213889-1-leon@kernel.org
> 
> >From Avihai,
> 
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance for
> applications that can handle this lack of strict ordering.
> 
> Currently, relaxed ordering can be set only by user space applications
> for user MRs. Not all user space applications support relaxed ordering
> and for this reason it was added as an optional capability that is
> disabled by default. This behavior is not changed as part of this series,
> and relaxed ordering remains disabled by default for user space.
> 
> On the other hand, kernel users should universally support relaxed
> ordering, as they are designed to read data only after observing the CQE
> and use the DMA API correctly. There are a few platforms with broken
> relaxed ordering implementation, but for them relaxed ordering is expected
> to be turned off globally in the PCI level. In addition, note that this is
> not the first use of relaxed ordering. Relaxed ordering has been enabled
> by default in mlx5 ethernet driver, and user space apps use it as well for
> quite a while.
> 
> Hence, this series enabled relaxed ordering by default for kernel users so
> they can benefit as well from the performance improvements.
> 
> The following test results show the performance improvement achieved
> with relaxed ordering. The test was performed by running FIO traffic
> between a NVIDIA DGX A100 (ConnectX-6 NICs and AMD CPUs) and a NVMe
> storage fabric, using NFSoRDMA:
> 
> Without Relaxed Ordering:
> READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
> io=1987GiB (2133GB), run=120422-120422msec
> 
> With relaxed ordering:
> READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
> io=2367GiB (2542GB), run=32492-32492msec
> 
> The series has been tested over NVMe, iSER, SRP and NFS with ConnectX-6
> NIC. The tests included FIO verify and stress tests, and various
> resiliency tests (shutting down NIC port in the middle of traffic,
> rebooting the target in the middle of traffic etc.).

There was such a big discussion on the last version I wondered why
this was so quiet. I guess because the cc list isn't very big..

Adding the people from the original thread, here is the patches:

https://lore.kernel.org/linux-rdma/cover.1621505111.git.leonro@nvidia.com/

I think this is the general approach that was asked for, to special case
uverbs and turn it on in kernel universally
 
Jason
David Laight May 27, 2021, 8:11 a.m. UTC | #2
From: Jason Gunthorpe
> Sent: 26 May 2021 20:30
> 
> On Thu, May 20, 2021 at 01:13:34PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Changelog:
> > v1:
> >  * Enabled by default RO in IB/core instead of changing all users
> > v0: https://lore.kernel.org/lkml/20210405052404.213889-1-leon@kernel.org
> >
> > >From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance for
> > applications that can handle this lack of strict ordering.
> >
> > Currently, relaxed ordering can be set only by user space applications
> > for user MRs. Not all user space applications support relaxed ordering
> > and for this reason it was added as an optional capability that is
> > disabled by default. This behavior is not changed as part of this series,
> > and relaxed ordering remains disabled by default for user space.
> >
> > On the other hand, kernel users should universally support relaxed
> > ordering, as they are designed to read data only after observing the CQE
> > and use the DMA API correctly. There are a few platforms with broken
> > relaxed ordering implementation, but for them relaxed ordering is expected
> > to be turned off globally in the PCI level. In addition, note that this is
> > not the first use of relaxed ordering. Relaxed ordering has been enabled
> > by default in mlx5 ethernet driver, and user space apps use it as well for
> > quite a while.
> >
> > Hence, this series enabled relaxed ordering by default for kernel users so
> > they can benefit as well from the performance improvements.
> >
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed by running FIO traffic
> > between a NVIDIA DGX A100 (ConnectX-6 NICs and AMD CPUs) and a NVMe
> > storage fabric, using NFSoRDMA:
> >
> > Without Relaxed Ordering:
> > READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
> > io=1987GiB (2133GB), run=120422-120422msec
> >
> > With relaxed ordering:
> > READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
> > io=2367GiB (2542GB), run=32492-32492msec
> >
> > The series has been tested over NVMe, iSER, SRP and NFS with ConnectX-6
> > NIC. The tests included FIO verify and stress tests, and various
> > resiliency tests (shutting down NIC port in the middle of traffic,
> > rebooting the target in the middle of traffic etc.).
> 
> There was such a big discussion on the last version I wondered why
> this was so quiet. I guess because the cc list isn't very big..
> 
> Adding the people from the original thread, here is the patches:
> 
> https://lore.kernel.org/linux-rdma/cover.1621505111.git.leonro@nvidia.com/
> 
> I think this is the general approach that was asked for, to special case
> uverbs and turn it on in kernel universally

I'm still not sure which PCIe transactions you are enabling relaxed
ordering for.
Nothing has ever said that in layman's terms.

IIRC PCIe targets (like ethernet chips) can use relaxed ordered
writes for frame contents but must use strongly ordered writes
for the corresponding ring (control structure) updates.

If the kernel is issuing relaxed ordered writes then the same
conditions would need to be satisfied.
CPU barrier instructions are unlikely to affect what happens
once a (posted) write is issued - but the re-ordering happens
at the PCIe bridges or actual targets.
So barrier instructions are unlikely to help.

I'm not sure what a 'MR' is in this context.
But if you are changing the something that is part of the
virtual to physical address mapping then such a global
change is actually likely to break anything that cares.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason Gunthorpe May 31, 2021, 6:13 p.m. UTC | #3
On Thu, May 27, 2021 at 08:11:14AM +0000, David Laight wrote:
> > There was such a big discussion on the last version I wondered why
> > this was so quiet. I guess because the cc list isn't very big..
> > 
> > Adding the people from the original thread, here is the patches:
> > 
> > https://lore.kernel.org/linux-rdma/cover.1621505111.git.leonro@nvidia.com/
> > 
> > I think this is the general approach that was asked for, to special case
> > uverbs and turn it on in kernel universally
> 
> I'm still not sure which PCIe transactions you are enabling relaxed
> ordering for.  Nothing has ever said that in layman's terms.
>
> IIRC PCIe targets (like ethernet chips) can use relaxed ordered
> writes for frame contents but must use strongly ordered writes
> for the corresponding ring (control structure) updates.

Right, it is exactly like this, just not expressed in ethernet
specific terms.

Data transfer TLPs are relaxed ordered and control structure TLPs are
normal ordered.

Jason
David Laight May 31, 2021, 9:45 p.m. UTC | #4
From: Jason Gunthorpe
> Sent: 31 May 2021 19:14
> 
> On Thu, May 27, 2021 at 08:11:14AM +0000, David Laight wrote:
> > > There was such a big discussion on the last version I wondered why
> > > this was so quiet. I guess because the cc list isn't very big..
> > >
> > > Adding the people from the original thread, here is the patches:
> > >
> > > https://lore.kernel.org/linux-rdma/cover.1621505111.git.leonro@nvidia.com/
> > >
> > > I think this is the general approach that was asked for, to special case
> > > uverbs and turn it on in kernel universally
> >
> > I'm still not sure which PCIe transactions you are enabling relaxed
> > ordering for.  Nothing has ever said that in layman's terms.
> >
> > IIRC PCIe targets (like ethernet chips) can use relaxed ordered
> > writes for frame contents but must use strongly ordered writes
> > for the corresponding ring (control structure) updates.
> 
> Right, it is exactly like this, just not expressed in ethernet
> specific terms.
> 
> Data transfer TLPs are relaxed ordered and control structure TLPs are
> normal ordered.

So exactly what is this patch doing?

'Enabling relaxed ordering' sounds like something that is setting
the 'relaxed ordering' bit in TLP.
Doing that in any global fashion is clearly broken.

OTOH if it is (effectively) stopping the clearing of the 'relaxed ordering'
bit by one of the PCIe bridges (or the root complex) then it is rather
different.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason Gunthorpe May 31, 2021, 10:44 p.m. UTC | #5
On Mon, May 31, 2021 at 09:45:47PM +0000, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 31 May 2021 19:14
> > 
> > On Thu, May 27, 2021 at 08:11:14AM +0000, David Laight wrote:
> > > > There was such a big discussion on the last version I wondered why
> > > > this was so quiet. I guess because the cc list isn't very big..
> > > >
> > > > Adding the people from the original thread, here is the patches:
> > > >
> > > > https://lore.kernel.org/linux-rdma/cover.1621505111.git.leonro@nvidia.com/
> > > >
> > > > I think this is the general approach that was asked for, to special case
> > > > uverbs and turn it on in kernel universally
> > >
> > > I'm still not sure which PCIe transactions you are enabling relaxed
> > > ordering for.  Nothing has ever said that in layman's terms.
> > >
> > > IIRC PCIe targets (like ethernet chips) can use relaxed ordered
> > > writes for frame contents but must use strongly ordered writes
> > > for the corresponding ring (control structure) updates.
> > 
> > Right, it is exactly like this, just not expressed in ethernet
> > specific terms.
> > 
> > Data transfer TLPs are relaxed ordered and control structure TLPs are
> > normal ordered.
> 
> So exactly what is this patch doing?

It allows RDMA devices to set the relaxed ordering bit in their PCI
TLPs following the rules outlined above, but specified in detail, in
the InfiniBand Architecture spec.

Today the Linux model prevents devices from using the PCI relaxed
ordering bit in their TLPs at all.

Jason