diff mbox series

[v2,rdma-next] RDMA/mlx5: Enable Relaxed Ordering by default for kernel ULPs

Message ID b7e820aab7402b8efa63605f4ea465831b3b1e5e.1623236426.git.leonro@nvidia.com (mailing list archive)
State New
Headers show
Series [v2,rdma-next] RDMA/mlx5: Enable Relaxed Ordering by default for kernel ULPs | expand

Commit Message

Leon Romanovsky June 9, 2021, 11:05 a.m. UTC
From: Avihai Horon <avihaih@nvidia.com>

Relaxed Ordering is a capability that can only benefit users that support
it. All kernel ULPs should support Relaxed Ordering, as they are designed
to read data only after observing the CQE and use the DMA API correctly.

Hence, implicitly enable Relaxed Ordering by default for kernel ULPs.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
Changelog:
v2:
 * Dropped IB/core patch and set RO implicitly in mlx5 exactly like in
   eth side of mlx5 driver.
v1: https://lore.kernel.org/lkml/cover.1621505111.git.leonro@nvidia.com
 * Enabled by default RO in IB/core instead of changing all users
v0: https://lore.kernel.org/lkml/20210405052404.213889-1-leon@kernel.org
---
 drivers/infiniband/hw/mlx5/mr.c | 10 ++++++----
 drivers/infiniband/hw/mlx5/wr.c |  5 ++++-
 2 files changed, 10 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig June 9, 2021, 12:52 p.m. UTC | #1
On Wed, Jun 09, 2021 at 02:05:03PM +0300, Leon Romanovsky wrote:
> From: Avihai Horon <avihaih@nvidia.com>
> 
> Relaxed Ordering is a capability that can only benefit users that support
> it. All kernel ULPs should support Relaxed Ordering, as they are designed
> to read data only after observing the CQE and use the DMA API correctly.
> 
> Hence, implicitly enable Relaxed Ordering by default for kernel ULPs.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
> Changelog:
> v2:
>  * Dropped IB/core patch and set RO implicitly in mlx5 exactly like in
>    eth side of mlx5 driver.

This looks great in terms of code changes.  But can we please also add a
patch to document that PCIe relaxed ordering is fine for kernel ULP usage
somewhere?
Leon Romanovsky June 9, 2021, 1:53 p.m. UTC | #2
On Wed, Jun 09, 2021 at 02:52:41PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 09, 2021 at 02:05:03PM +0300, Leon Romanovsky wrote:
> > From: Avihai Horon <avihaih@nvidia.com>
> > 
> > Relaxed Ordering is a capability that can only benefit users that support
> > it. All kernel ULPs should support Relaxed Ordering, as they are designed
> > to read data only after observing the CQE and use the DMA API correctly.
> > 
> > Hence, implicitly enable Relaxed Ordering by default for kernel ULPs.
> > 
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > Changelog:
> > v2:
> >  * Dropped IB/core patch and set RO implicitly in mlx5 exactly like in
> >    eth side of mlx5 driver.
> 
> This looks great in terms of code changes.  But can we please also add a
> patch to document that PCIe relaxed ordering is fine for kernel ULP usage
> somewhere?

Sure, did you have in mind some concrete place? Or will new file in the
Documentation/infiniband/ folder be good enough too?

Thanks
Christoph Hellwig June 9, 2021, 1:59 p.m. UTC | #3
On Wed, Jun 09, 2021 at 04:53:23PM +0300, Leon Romanovsky wrote:
> Sure, did you have in mind some concrete place? Or will new file in the
> Documentation/infiniband/ folder be good enough too?

Maybe add a kerneldoc comment for the map_mr_sg() ib_device_ops method?
David Laight June 9, 2021, 2:10 p.m. UTC | #4
From: Christoph Hellwig
> Sent: 09 June 2021 13:53
> 
> On Wed, Jun 09, 2021 at 02:05:03PM +0300, Leon Romanovsky wrote:
> > From: Avihai Horon <avihaih@nvidia.com>
> >
> > Relaxed Ordering is a capability that can only benefit users that support
> > it. All kernel ULPs should support Relaxed Ordering, as they are designed
> > to read data only after observing the CQE and use the DMA API correctly.
> >
> > Hence, implicitly enable Relaxed Ordering by default for kernel ULPs.
> >
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > Changelog:
> > v2:
> >  * Dropped IB/core patch and set RO implicitly in mlx5 exactly like in
> >    eth side of mlx5 driver.
> 
> This looks great in terms of code changes.  But can we please also add a
> patch to document that PCIe relaxed ordering is fine for kernel ULP usage
> somewhere?

Several things need to happen to use relaxed ordering:
1) The pcie target has to be able to use RO for buffer writes
   and non-RO for control writes.
2) The Linux driver has to enable (1).
3) The generic Linux kernel has to 'not mask' RO on all the PCIe
   bridges and root.
4) The hardware memory system has to 'not break' when passes a RO write.

The comments about the DMA API are almost pointless.
They'd only be relevant if the driver has looking for the last
byte of a buffer to change - and then assuming the rest is valid.

This patch looks like a driver patch - so is changing (2) above.

I've seen another patch that (I think) is enabling (3).
Although some X86 cpu are known to be broken (aka 4).

And I still don't know what a ULP is.

I actually know a bit about TLP.
I found (and fixed) a bug in the Altera/Intel fpga logic
where it didn't like receiving two data TLP in response
to a single read TLP.
We've also (now) got logic in the fpga that traces all received
and sent TLP to an internal memory block.
So I can see what the cpus we have actually generate.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Chuck Lever III June 9, 2021, 2:37 p.m. UTC | #5
Hi David-

> On Jun 9, 2021, at 10:10 AM, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> And I still don't know what a ULP is.

Upper Layer Protocol.

That's a generic term for an RDMA verbs consumer, like NVMe or
RPC-over-RDMA.


--
Chuck Lever
David Laight June 9, 2021, 3:05 p.m. UTC | #6
From: Chuck Lever III
> Sent: 09 June 2021 15:37
> 
> Hi David-
> 
> > On Jun 9, 2021, at 10:10 AM, David Laight <David.Laight@ACULAB.COM> wrote:
> >
> > And I still don't know what a ULP is.
> 
> Upper Layer Protocol.
> 
> That's a generic term for an RDMA verbs consumer, like NVMe or
> RPC-over-RDMA.

No wonder I don't spot what it meant.
I'm guessing you have something specific in mind for RDMA as well.

Don't assume that everyone has read all the high level protocol
specs (and remembers the all the TLA (and ETLA)) when talking
about very low level hardware features.

Especially when you are also referring to how the 'relaxed ordering'
bit of a PCIe write TLP is processed.

This all makes your commit message even less meaningful.

In principle some writel() could generate PCIe write TLP (going
to the target) that have the 'relaxed ordering' bit set.
So a ULP that supports relaxed ordering could actually expect
to generate them - even though there is probably no method
of setting the bit.
Although, in principle, I guess that areas that are 'prefetchable'
(for reads) could be deemed suitable for relaxed writes.
(That way probably lies madness and a load of impossible to fix
timing bugs!)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason Gunthorpe June 9, 2021, 3:09 p.m. UTC | #7
On Wed, Jun 09, 2021 at 03:05:52PM +0000, David Laight wrote:

> In principle some writel() could generate PCIe write TLP (going
> to the target) that have the 'relaxed ordering' bit set.

In Linux we call this writel_relaxed(), though I know of no
implementation that sets the RO bit in the TLP based on this, it would
be semantically correct to do so.

writel() has strong order requirements and must not generate a RO TLP.

Jason
David Laight June 9, 2021, 3:48 p.m. UTC | #8
From: Jason Gunthorpe
> Sent: 09 June 2021 16:09
> 
> On Wed, Jun 09, 2021 at 03:05:52PM +0000, David Laight wrote:
> 
> > In principle some writel() could generate PCIe write TLP (going
> > to the target) that have the 'relaxed ordering' bit set.
> 
> In Linux we call this writel_relaxed(), though I know of no
> implementation that sets the RO bit in the TLP based on this, it would
> be semantically correct to do so.
> 
> writel() has strong order requirements and must not generate a RO TLP.

Somewhere I'd forgotten about that :-(
It usually just allows the compiler and cpu hardware re-sequence
the bus cycles.

OTOH I doubt any/many PCIe targets have 'memory' areas that would
benefit from RO write TLP.
Especially since everything is organised to use target issued buffer
copies.

I'm guessing that the benefits from RO are when the writes hit memory
that is on a NUMA node or 'differently cached'.
So writes to once cache line can proceed while earlier writes are
still waiting for the cache-coherency protocol.

From what I've seen writel() aren't too bad - they are async.
The real problem is readl().
The x86 cpu I have use a separate TLP id (I've forgotten the correct
term) for each cpu core.
So while multiple cpu can (and do) issue concurrent reads, reads from
a single cpu happen one TLP at a time - even though it would be legitimate
for the out-of-order execution unit to issue additional read TLP.
There are times when you really do have to do PIO buffer reads :-(

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Leon Romanovsky June 10, 2021, 7:44 a.m. UTC | #9
On Wed, Jun 09, 2021 at 03:59:24PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 09, 2021 at 04:53:23PM +0300, Leon Romanovsky wrote:
> > Sure, did you have in mind some concrete place? Or will new file in the
> > Documentation/infiniband/ folder be good enough too?
> 
> Maybe add a kerneldoc comment for the map_mr_sg() ib_device_ops method?

I hope that this hunk from the previous cover letter is good enough.

Jason, do you want v3? or you can fold this into v2?

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9423e70a881c..aaf63a6643d6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2468,6 +2468,13 @@ struct ib_device_ops {
                         enum ib_uverbs_advise_mr_advice advice, u32 flags,
                         struct ib_sge *sg_list, u32 num_sge,
                         struct uverbs_attr_bundle *attrs);
+       /*
+        * Kernel users should universally support relaxed ordering (RO),
+        * as they are designed to read data only after observing the CQE
+        * and use the DMA API correctly.
+        *
+        * Some drivers implicitly enable RO if platform supports it.
+        */
        int (*map_mr_sg)(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
                         unsigned int *sg_offset);
        int (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3363cde85b14..2182e76ae734 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -69,6 +69,7 @@  static void set_mkc_access_pd_addr_fields(void *mkc, int acc, u64 start_addr,
 					  struct ib_pd *pd)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
+	bool ro_pci_enabled = pcie_relaxed_ordering_enabled(dev->mdev->pdev);
 
 	MLX5_SET(mkc, mkc, a, !!(acc & IB_ACCESS_REMOTE_ATOMIC));
 	MLX5_SET(mkc, mkc, rw, !!(acc & IB_ACCESS_REMOTE_WRITE));
@@ -78,10 +79,10 @@  static void set_mkc_access_pd_addr_fields(void *mkc, int acc, u64 start_addr,
 
 	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write))
 		MLX5_SET(mkc, mkc, relaxed_ordering_write,
-			 !!(acc & IB_ACCESS_RELAXED_ORDERING));
+			 acc & IB_ACCESS_RELAXED_ORDERING && ro_pci_enabled);
 	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read))
 		MLX5_SET(mkc, mkc, relaxed_ordering_read,
-			 !!(acc & IB_ACCESS_RELAXED_ORDERING));
+			 acc & IB_ACCESS_RELAXED_ORDERING && ro_pci_enabled);
 
 	MLX5_SET(mkc, mkc, pd, to_mpd(pd)->pdn);
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
@@ -812,7 +813,8 @@  struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc)
 
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_PA);
 	MLX5_SET(mkc, mkc, length64, 1);
-	set_mkc_access_pd_addr_fields(mkc, acc, 0, pd);
+	set_mkc_access_pd_addr_fields(mkc, acc | IB_ACCESS_RELAXED_ORDERING, 0,
+				      pd);
 
 	err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
 	if (err)
@@ -2022,7 +2024,7 @@  static void mlx5_set_umr_free_mkey(struct ib_pd *pd, u32 *in, int ndescs,
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 
 	/* This is only used from the kernel, so setting the PD is OK. */
-	set_mkc_access_pd_addr_fields(mkc, 0, 0, pd);
+	set_mkc_access_pd_addr_fields(mkc, IB_ACCESS_RELAXED_ORDERING, 0, pd);
 	MLX5_SET(mkc, mkc, free, 1);
 	MLX5_SET(mkc, mkc, translations_octword_size, ndescs);
 	MLX5_SET(mkc, mkc, access_mode_1_0, access_mode & 0x3);
diff --git a/drivers/infiniband/hw/mlx5/wr.c b/drivers/infiniband/hw/mlx5/wr.c
index 6880627c45be..8841620af82f 100644
--- a/drivers/infiniband/hw/mlx5/wr.c
+++ b/drivers/infiniband/hw/mlx5/wr.c
@@ -866,7 +866,10 @@  static int set_reg_wr(struct mlx5_ib_qp *qp,
 	bool atomic = wr->access & IB_ACCESS_REMOTE_ATOMIC;
 	u8 flags = 0;
 
-	/* Matches access in mlx5_set_umr_free_mkey() */
+	/* Matches access in mlx5_set_umr_free_mkey().
+	 * Relaxed Ordering is set implicitly in mlx5_set_umr_free_mkey() and
+	 * kernel ULPs are not aware of it, so we don't set it here.
+	 */
 	if (!mlx5_ib_can_reconfig_with_umr(dev, 0, wr->access)) {
 		mlx5_ib_warn(
 			to_mdev(qp->ibqp.device),