diff mbox

[RFC,1/3] IB/core: Expose a device attribute for rdma_read access flags

Message ID 1447152255-28231-2-git-send-email-sagig@mellanox.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Sagi Grimberg Nov. 10, 2015, 10:44 a.m. UTC
Different devices may require different access permissions
for rdma reads e.g. for Infiniband devices, local write access
suffices while iWARP devices require remote write permissions as
well.

This situation generates extra logic for ULPs that need to be aware
of the underlying device to determine rdma read access. Instead,
add a device attribute for ULPs to use without being aware of the
underlying device.

Also, set rdma_read_access_flags in the relevant device drivers:
mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
---
 drivers/infiniband/hw/cxgb3/iwch_provider.c  | 2 ++
 drivers/infiniband/hw/cxgb4/provider.c       | 2 ++
 drivers/infiniband/hw/mlx4/main.c            | 1 +
 drivers/infiniband/hw/mlx5/main.c            | 1 +
 drivers/infiniband/hw/mthca/mthca_provider.c | 1 +
 drivers/infiniband/hw/nes/nes_verbs.c        | 2 ++
 drivers/infiniband/hw/ocrdma/ocrdma_main.c   | 1 +
 drivers/infiniband/hw/qib/qib_verbs.c        | 1 +
 drivers/staging/rdma/hfi1/verbs.c            | 2 ++
 include/rdma/ib_verbs.h                      | 1 +
 10 files changed, 14 insertions(+)

Comments

Christoph Hellwig Nov. 10, 2015, 11:25 a.m. UTC | #1
On Tue, Nov 10, 2015 at 12:44:13PM +0200, Sagi Grimberg wrote:
> Different devices may require different access permissions
> for rdma reads e.g. for Infiniband devices, local write access
> suffices while iWARP devices require remote write permissions as
> well.
> 
> This situation generates extra logic for ULPs that need to be aware
> of the underlying device to determine rdma read access. Instead,
> add a device attribute for ULPs to use without being aware of the
> underlying device.
> 
> Also, set rdma_read_access_flags in the relevant device drivers:
> mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
> cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE

From all I can tell nes also is a iWarp driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 10, 2015, 11:36 a.m. UTC | #2
>  From all I can tell nes also is a iWarp driver.

It is.. I don't know why I treated it as IB :)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Nov. 10, 2015, 11:51 a.m. UTC | #3
Hi,

Le mardi 10 novembre 2015 à 12:44 +0200, Sagi Grimberg a écrit :
> Different devices may require different access permissions
> for rdma reads e.g. for Infiniband devices, local write access
> suffices while iWARP devices require remote write permissions as
> well.
> 
> This situation generates extra logic for ULPs that need to be aware
> of the underlying device to determine rdma read access. Instead,
> add a device attribute for ULPs to use without being aware of the
> underlying device.
> 
> Also, set rdma_read_access_flags in the relevant device drivers:
> mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
> cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE
> 

Why were those hw providers not modified to
enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
set it for them ?

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 10, 2015, 12:28 p.m. UTC | #4
Hi Yann,

>
> Why were those hw providers not modified to
> enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
> set it for them ?

Do you mean that ULPs will set IB_ACCESS_LOCAL_WRITE and
iWARP providers executing the memory registration will add
IB_ACCESS_REMOTE_WRITE? That's even better I think! Given that
changing the access rights under the user legs is something we'd want
to do. is it?

Given that for IB memory registration is not even needed for rdma_read,
I'd like to move away from code like:

if (rdma_protocol_iwarp())
	register memory for rdma_read
else
	go ahead with ib_sge for rdma_read

I've maid some initial experimentation on trying to abstract this
aspect from ULPs but it's far from being complete and needs more work.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey Nov. 10, 2015, 12:36 p.m. UTC | #5
On 11/10/2015 6:51 AM, Yann Droneaud wrote:
> Le mardi 10 novembre 2015 à 12:44 +0200, Sagi Grimberg a écrit :
>> Also, set rdma_read_access_flags in the relevant device drivers:
>> mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
>> cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE
>>
>
> Why were those hw providers not modified to
> enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
> set it for them ?

Because the Linux verbs don't tell the driver whether the registration
is for an RDMA Read sink buffer.

Sagi, the Windows NDKPI has an NDK_MR_FLAG_RDMA_READ_SINK attribute
which the upper layer can use to convey this information, I've mentioned
it here before.
https://msdn.microsoft.com/en-us/library/windows/hardware/hh439908(v=vs.85).aspx

When this approach is used, the upper layer doesn't have to be aware
at all of the lower layer's details.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 10, 2015, 12:39 p.m. UTC | #6
On 10/11/2015 14:28, Sagi Grimberg wrote:
> Hi Yann,
>
>>
>> Why were those hw providers not modified to
>> enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
>> set it for them ?
>
> Do you mean that ULPs will set IB_ACCESS_LOCAL_WRITE and
> iWARP providers executing the memory registration will add
> IB_ACCESS_REMOTE_WRITE? That's even better I think! Given that
> changing the access rights under the user legs is something we'd want
> to do. is it?

Oh, forgot that the providers do not know how this key will be used.
Thanks Tom for reminding me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 10, 2015, 12:42 p.m. UTC | #7
> Sagi, the Windows NDKPI has an NDK_MR_FLAG_RDMA_READ_SINK attribute
> which the upper layer can use to convey this information, I've mentioned
> it here before.
> https://msdn.microsoft.com/en-us/library/windows/hardware/hh439908(v=vs.85).aspx
>

Thanks for the tip Tom.

>
> When this approach is used, the upper layer doesn't have to be aware
> at all of the lower layer's details.

Yea, we could do that too. Any preferences from other people?
I'm pretty indifferent on which way to go...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 10, 2015, 1:06 p.m. UTC | #8
On Tue, Nov 10, 2015 at 02:42:44PM +0200, Sagi Grimberg wrote:
> >When this approach is used, the upper layer doesn't have to be aware
> >at all of the lower layer's details.
> 
> Yea, we could do that too. Any preferences from other people?
> I'm pretty indifferent on which way to go...

Yes, that's the way to go.  I though we had agree on doing that as
one of the next step of the MR API cleanups, but decided to postpone
it past the first iteration for some reason that made it non-trivial.

I have to say I still don't like the Windows API either, though as it
still keeps the criptic {local,remote}_{read,write} flags.

Basically what we want is two-dimension selection:

	/*
	 * If %true this is the target of RDMA READ/WRITE operations
	 * from the remote system.  If %false this is the sink / source
	 * for RDMA READ/WRITE operations issued by the local system.
	 */
	enum ib_mr_scope { IB_MR_REMOTE, IB_MR_LOCAL };

	/*
	 * Operation we're registering for.  Can be OR'ed together
	 * when allowing RDMA READs and WRITEs on a single MR.
	 */
	enum ib_mr_dir { IB_MR_RDMA_READ = 1 << 0, IB_MR_RDMA_WRITE = 1 << 1 };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 10, 2015, 6:01 p.m. UTC | #9
On Tue, Nov 10, 2015 at 12:44:13PM +0200, Sagi Grimberg wrote:
> Different devices may require different access permissions
> for rdma reads e.g. for Infiniband devices, local write access
> suffices while iWARP devices require remote write permissions as
> well.
> 
> This situation generates extra logic for ULPs that need to be aware
> of the underlying device to determine rdma read access. Instead,
> add a device attribute for ULPs to use without being aware of the
> underlying device.
> 
> Also, set rdma_read_access_flags in the relevant device drivers:
> mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
> cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE

IB devices *should* be using a local_dma_lkey instead of a memory
registration for the RDMA READ target memory.

iWarp devices *must* use a memory registration instead.

Having an API that encourages ULPs to do useless MR work on IB does
not seem like a great idea, but if the ULP needs to create a MR anyhow
(SG limits or whatever), then it makes some sense.. But not in absence
of the 'need a mr' general check..

Finally, please don't add rdma_read_access_flags - we went to a lot of
trouble to add the cap stuff, and this stuff is already covered by it.
No need to change every driver.

I'd suggest something like

  unsigned int rdma_cap_rdma_read_mr_flags(const struct ib_pd *pd)
  {
     if (rdma_protocol_iwarp(pd->device, rdma_start_port(pd->device)))
           return IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
     return IB_ACCESS_LOCAL_WRITE;
  }

  bool rdma_cap_need_rdma_read_mr(const struct ib_pd *pd) ...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 10, 2015, 6:06 p.m. UTC | #10
On Tue, Nov 10, 2015 at 12:51:10PM +0100, Yann Droneaud wrote:
> Why were those hw providers not modified to
> enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
> set it for them ?

iWarp hardware simply cannot do it it all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 11, 2015, 8:08 a.m. UTC | #11
On Tue, Nov 10, 2015 at 11:01:56AM -0700, Jason Gunthorpe wrote:
> No need to change every driver.
> 
> I'd suggest something like
> 
>   unsigned int rdma_cap_rdma_read_mr_flags(const struct ib_pd *pd)
>   {
>      if (rdma_protocol_iwarp(pd->device, rdma_start_port(pd->device)))
>            return IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
>      return IB_ACCESS_LOCAL_WRITE;
>   }

Yes, this looks good!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 11, 2015, 9:09 a.m. UTC | #12
On 11/11/2015 10:08, Christoph Hellwig wrote:
> On Tue, Nov 10, 2015 at 11:01:56AM -0700, Jason Gunthorpe wrote:
>> No need to change every driver.
>>
>> I'd suggest something like
>>
>>    unsigned int rdma_cap_rdma_read_mr_flags(const struct ib_pd *pd)
>>    {
>>       if (rdma_protocol_iwarp(pd->device, rdma_start_port(pd->device)))
>>             return IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
>>       return IB_ACCESS_LOCAL_WRITE;
>>    }
>
> Yes, this looks good!

It does, but it doesn't look like something we'd want to check for each
IO...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 11, 2015, 10:26 a.m. UTC | #13
On Wed, Nov 11, 2015 at 11:09:03AM +0200, Sagi Grimberg wrote:
> It does, but it doesn't look like something we'd want to check for each
> IO...

Both potential callers already have a access flags variable in object
that's assigned to at setup time so I don't see a problem here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index db4c08c53ae3..483ae3c8a13b 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1467,6 +1467,8 @@  int iwch_register_device(struct iwch_dev *dev)
 	dev->ibdev.max_pd = dev->attr.max_pds;
 	dev->ibdev.local_ca_ack_delay = 0;
 	dev->ibdev.max_fast_reg_page_list_len = T3_MAX_FASTREG_DEPTH;
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE |
+					IB_ACCESS_REMOTE_WRITE;
 
 	ret = ib_register_device(&dev->ibdev, NULL);
 	if (ret)
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 84e9b5db0341..001db1cc737c 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -564,6 +564,8 @@  int c4iw_register_device(struct c4iw_dev *dev)
 	dev->ibdev.max_pd = T4_MAX_NUM_PD;
 	dev->ibdev.local_ca_ack_delay = 0;
 	dev->ibdev.max_fast_reg_page_list_len = t4_max_fr_depth(use_dsgl);
+	dev->ibdev.rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE |
+					    IB_ACCESS_REMOTE_WRITE;
 
 	ret = ib_register_device(&dev->ibdev, NULL);
 	if (ret)
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 46305dc27f9f..82af524dfb4e 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -524,6 +524,7 @@  static int mlx4_ib_init_device_flags(struct ib_device *ibdev)
 	ibdev->max_map_per_fmr = dev->dev->caps.max_fmr_maps;
 	ibdev->hca_core_clock = dev->dev->caps.hca_core_clock * 1000UL;
 	ibdev->timestamp_mask = 0xFFFFFFFFFFFFULL;
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
 
 	if (!mlx4_is_slave(dev->dev))
 		err = mlx4_get_internal_clock_params(dev->dev, &clock_params);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 5b733221ca5f..b29f79a03494 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -287,6 +287,7 @@  static int mlx5_ib_init_device_flags(struct ib_device *ibdev)
 	ibdev->max_total_mcast_qp_attach = ibdev->max_mcast_qp_attach *
 					   ibdev->max_mcast_grp;
 	ibdev->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	if (MLX5_CAP_GEN(mdev, pg))
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 28d7a8bee3f7..ea1882bbe7a5 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -110,6 +110,7 @@  static int mthca_init_device_flags(struct ib_device *ibdev)
 	ibdev->max_mcast_qp_attach = MTHCA_QP_PER_MGM;
 	ibdev->max_total_mcast_qp_attach = ibdev->max_mcast_qp_attach *
 					   ibdev->max_mcast_grp;
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
 	/*
 	 * If Sinai memory key optimization is being used, then only
 	 * the 8-bit key portion will change.  For other HCAs, the
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 2ef911953e38..adcea173f8dc 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -3868,6 +3868,8 @@  struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
 	nesibdev->ibdev.max_mw = nesibdev->max_mr;
 	nesibdev->ibdev.max_pd = nesibdev->max_pd;
 	nesibdev->ibdev.max_sge_rd = 1;
+	nesibdev->ibdev.rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
+
 	switch (nesdev->nesadapter->max_irrq_wr) {
 		case 0:
 			nesibdev->ibdev.max_qp_rd_atom = 2;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 00ddcf9294e4..235bc90bf2c6 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -244,6 +244,7 @@  static int ocrdma_register_device(struct ocrdma_dev *dev)
 	dev->ibdev.local_ca_ack_delay = dev->attr.local_ca_ack_delay;
 	dev->ibdev.max_fast_reg_page_list_len = dev->attr.max_pages_per_frmr;
 	dev->ibdev.max_pkeys = 1;
+	dev->ibdev.rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
 
 	return ib_register_device(&dev->ibdev, NULL);
 }
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index 9e1af0b9857a..be84307611c4 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -2257,6 +2257,7 @@  int qib_register_ib_device(struct qib_devdata *dd)
 	ibdev->max_mcast_qp_attach = ib_qib_max_mcast_qp_attached;
 	ibdev->max_total_mcast_qp_attach = ibdev->max_mcast_qp_attach *
 		ibdev->max_mcast_grp;
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE;
 
 	snprintf(ibdev->node_desc, sizeof(ibdev->node_desc),
 		 "Intel Infiniband HCA %s", init_utsname()->nodename);
diff --git a/drivers/staging/rdma/hfi1/verbs.c b/drivers/staging/rdma/hfi1/verbs.c
index 664548a62e4d..d3e0dd27a8e2 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -2062,6 +2062,8 @@  int hfi1_register_ib_device(struct hfi1_devdata *dd)
 	ibdev->max_mcast_qp_attach = hfi1_max_mcast_qp_attached;
 	ibdev->max_total_mcast_qp_attach = ibdev->max_mcast_qp_attach *
 		ibdev->max_mcast_grp;
+	ibdev->rdma_read_access_flags = IB_ACCESS_LOCAL_WRITE |
+					IB_ACCESS_REMOTE_WRITE;
 
 	strncpy(ibdev->node_desc, init_utsname()->nodename,
 		sizeof(ibdev->node_desc));
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 861d92f1a0b0..2abbdd976c18 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1851,6 +1851,7 @@  struct ib_device {
 	struct ib_odp_caps	odp_caps;
 	uint64_t		timestamp_mask;
 	uint64_t		hca_core_clock; /* in KHZ */
+	int			rdma_read_access_flags;
 
 	/**
 	 * The following mandatory functions are used only at device