mbox series

[v1,00/24] Shared PD and MR

Message ID 20190821142125.5706-1-yuval.shaia@oracle.com (mailing list archive)
Headers show
Series Shared PD and MR | expand

Message

Yuval Shaia Aug. 21, 2019, 2:21 p.m. UTC
Following patch-set introduce the shared object feature.

A shared object feature allows one process to create HW objects (currently
PD and MR) so that a second process can import.

Patch-set is logically splits to 4 parts as the following:
- patches 1 to 7 and 18 are preparation steps.
- patches 8 to 14 are the implementation of import PD
- patches 15 to 17 are the implementation of the verb
- patches 19 to 24 are the implementation of import MR

v0 -> v1:
	* Delete the patch "IB/uverbs: ufile must be freed only when not
	  used anymore". The process can die, the ucontext remains until
	  last reference to it is closed.
	* Rebase to latest for-next branch

Shamir Rabinovitch (16):
  RDMA/uverbs: uobj_get_obj_read should return the ib_uobject
  RDMA/uverbs: Delete the macro uobj_put_obj_read
  RDMA/nldev: ib_pd can be pointed by multiple ib_ucontext
  IB/{core,hw}: ib_pd should not have ib_uobject pointer
  IB/core: ib_uobject need HW object reference count
  IB/uverbs: Helper function to initialize ufile member of
    uverbs_attr_bundle
  IB/uverbs: Add context import lock/unlock helper
  IB/verbs: Prototype of HW object clone callback
  IB/mlx4: Add implementation of clone_pd callback
  IB/mlx5: Add implementation of clone_pd callback
  RDMA/rxe: Add implementation of clone_pd callback
  IB/uverbs: Add clone reference counting to ib_pd
  IB/uverbs: Add PD import verb
  IB/mlx4: Enable import from FD verb
  IB/mlx5: Enable import from FD verb
  RDMA/rxe: Enable import from FD verb

Yuval Shaia (8):
  IB/core: Install clone ib_pd in device ops
  IB/core: ib_mr should not have ib_uobject pointer
  IB/core: Install clone ib_mr in device ops
  IB/mlx4: Add implementation of clone_pd callback
  IB/mlx5: Add implementation of clone_pd callback
  RDMA/rxe: Add implementation of clone_pd callback
  IB/uverbs: Add clone reference counting to ib_mr
  IB/uverbs: Add MR import verb

 drivers/infiniband/core/device.c              |   2 +
 drivers/infiniband/core/nldev.c               | 127 ++++-
 drivers/infiniband/core/rdma_core.c           |  23 +-
 drivers/infiniband/core/uverbs.h              |   2 +
 drivers/infiniband/core/uverbs_cmd.c          | 489 +++++++++++++++---
 drivers/infiniband/core/uverbs_main.c         |   1 +
 drivers/infiniband/core/uverbs_std_types_mr.c |   1 -
 drivers/infiniband/core/verbs.c               |   4 -
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c    |   1 -
 drivers/infiniband/hw/mlx4/main.c             |  18 +-
 drivers/infiniband/hw/mlx5/main.c             |  34 +-
 drivers/infiniband/hw/mthca/mthca_qp.c        |   3 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c         |   5 +
 include/rdma/ib_verbs.h                       |  43 +-
 include/rdma/uverbs_std_types.h               |  11 +-
 include/uapi/rdma/ib_user_verbs.h             |  15 +
 include/uapi/rdma/rdma_netlink.h              |   3 +
 17 files changed, 669 insertions(+), 113 deletions(-)

Comments

Jason Gunthorpe Aug. 21, 2019, 2:50 p.m. UTC | #1
On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> Following patch-set introduce the shared object feature.
> 
> A shared object feature allows one process to create HW objects (currently
> PD and MR) so that a second process can import.
> 
> Patch-set is logically splits to 4 parts as the following:
> - patches 1 to 7 and 18 are preparation steps.
> - patches 8 to 14 are the implementation of import PD
> - patches 15 to 17 are the implementation of the verb
> - patches 19 to 24 are the implementation of import MR

This is way too big. 10-14 patches at most in a series.

Jason
Ira Weiny Aug. 21, 2019, 11:37 p.m. UTC | #2
On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> Following patch-set introduce the shared object feature.
> 
> A shared object feature allows one process to create HW objects (currently
> PD and MR) so that a second process can import.

For something this fundamental I think the cover letter should be more
detailed than this.  Questions I have without digging into the code:

What is the use case?

What is the "key" that allows a MR to be shared among 2 processes?  Do you
introduce some PD identifier?  And then some {PDID, lkey} tuple is used to ID
the MR?

I assume you have to share the PD first and then any MR in the shared PD can be
shared?  If so how does the MR get shared?

Again I'm concerned with how this will interact with the RDMA and file system
interaction we have been trying to fix.

Why is SCM_RIGHTS on the rdma context FD not sufficient to share the entire
context, PD, and all MR's?

Ira

> 
> Patch-set is logically splits to 4 parts as the following:
> - patches 1 to 7 and 18 are preparation steps.
> - patches 8 to 14 are the implementation of import PD
> - patches 15 to 17 are the implementation of the verb
> - patches 19 to 24 are the implementation of import MR
> 
> v0 -> v1:
> 	* Delete the patch "IB/uverbs: ufile must be freed only when not
> 	  used anymore". The process can die, the ucontext remains until
> 	  last reference to it is closed.
> 	* Rebase to latest for-next branch
> 
> Shamir Rabinovitch (16):
>   RDMA/uverbs: uobj_get_obj_read should return the ib_uobject
>   RDMA/uverbs: Delete the macro uobj_put_obj_read
>   RDMA/nldev: ib_pd can be pointed by multiple ib_ucontext
>   IB/{core,hw}: ib_pd should not have ib_uobject pointer
>   IB/core: ib_uobject need HW object reference count
>   IB/uverbs: Helper function to initialize ufile member of
>     uverbs_attr_bundle
>   IB/uverbs: Add context import lock/unlock helper
>   IB/verbs: Prototype of HW object clone callback
>   IB/mlx4: Add implementation of clone_pd callback
>   IB/mlx5: Add implementation of clone_pd callback
>   RDMA/rxe: Add implementation of clone_pd callback
>   IB/uverbs: Add clone reference counting to ib_pd
>   IB/uverbs: Add PD import verb
>   IB/mlx4: Enable import from FD verb
>   IB/mlx5: Enable import from FD verb
>   RDMA/rxe: Enable import from FD verb
> 
> Yuval Shaia (8):
>   IB/core: Install clone ib_pd in device ops
>   IB/core: ib_mr should not have ib_uobject pointer
>   IB/core: Install clone ib_mr in device ops
>   IB/mlx4: Add implementation of clone_pd callback
>   IB/mlx5: Add implementation of clone_pd callback
>   RDMA/rxe: Add implementation of clone_pd callback
>   IB/uverbs: Add clone reference counting to ib_mr
>   IB/uverbs: Add MR import verb
> 
>  drivers/infiniband/core/device.c              |   2 +
>  drivers/infiniband/core/nldev.c               | 127 ++++-
>  drivers/infiniband/core/rdma_core.c           |  23 +-
>  drivers/infiniband/core/uverbs.h              |   2 +
>  drivers/infiniband/core/uverbs_cmd.c          | 489 +++++++++++++++---
>  drivers/infiniband/core/uverbs_main.c         |   1 +
>  drivers/infiniband/core/uverbs_std_types_mr.c |   1 -
>  drivers/infiniband/core/verbs.c               |   4 -
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c    |   1 -
>  drivers/infiniband/hw/mlx4/main.c             |  18 +-
>  drivers/infiniband/hw/mlx5/main.c             |  34 +-
>  drivers/infiniband/hw/mthca/mthca_qp.c        |   3 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c         |   5 +
>  include/rdma/ib_verbs.h                       |  43 +-
>  include/rdma/uverbs_std_types.h               |  11 +-
>  include/uapi/rdma/ib_user_verbs.h             |  15 +
>  include/uapi/rdma/rdma_netlink.h              |   3 +
>  17 files changed, 669 insertions(+), 113 deletions(-)
> 
> -- 
> 2.20.1
>
Yuval Shaia Aug. 22, 2019, 8:41 a.m. UTC | #3
On Wed, Aug 21, 2019 at 04:37:37PM -0700, Ira Weiny wrote:
> On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> > Following patch-set introduce the shared object feature.
> > 
> > A shared object feature allows one process to create HW objects (currently
> > PD and MR) so that a second process can import.

Hi Ira,

> 
> For something this fundamental I think the cover letter should be more
> detailed than this.  Questions I have without digging into the code:
> 
> What is the use case?

I have only one use case but i didn't added it to commit log just not to
limit the usage of this feature but you are right, cover letter is great
for such things, will add it for v2.

Anyway, here is our use case: Consider a case of server with huge amount
of memory and some hundreds or even thousands processes are using it to
serves clients requests. In this case the HCA will have to manage hundreds
or thousands MRs. A better design maybe would be that one process will
create one (or several) MR(s) which will be shared with the other
processes. This will reduce the number of address translation entries and
cache miss dramatically.

> 
> What is the "key" that allows a MR to be shared among 2 processes?  Do you
> introduce some PD identifier?  And then some {PDID, lkey} tuple is used to ID
> the MR?
> 
> I assume you have to share the PD first and then any MR in the shared PD can be
> shared?  If so how does the MR get shared?

Sorry, i'm not following.
I think the term 'share' is somehow mistake, it is actually a process
'imports' objects into it's context. And yes, the workflow is first to
import the PD and then import the MR.

> 
> Again I'm concerned with how this will interact with the RDMA and file system
> interaction we have been trying to fix.

I'm not aware of this file-system thing, can you point me to some
discussion on that so i'll see how this patch-set affect it.

> 
> Why is SCM_RIGHTS on the rdma context FD not sufficient to share the entire
> context, PD, and all MR's?

Well, this SCM_RIGHTS is great, one can share the IB context with another.
But it is not enough, because:
- What API the second process can use to get his hands on one of the PDs or
  MRs from this context?
- What mechanism takes care of the destruction of such objects (SCM_RIGHTS
  takes care for the ref counting of the context but i'm referring to the
  PDs and MRs objects)?

The entire purpose of this patch set is to address these two questions.

Yuval

> 
> Ira
> 
> > 
> > Patch-set is logically splits to 4 parts as the following:
> > - patches 1 to 7 and 18 are preparation steps.
> > - patches 8 to 14 are the implementation of import PD
> > - patches 15 to 17 are the implementation of the verb
> > - patches 19 to 24 are the implementation of import MR
> > 
> > v0 -> v1:
> > 	* Delete the patch "IB/uverbs: ufile must be freed only when not
> > 	  used anymore". The process can die, the ucontext remains until
> > 	  last reference to it is closed.
> > 	* Rebase to latest for-next branch
> > 
> > Shamir Rabinovitch (16):
> >   RDMA/uverbs: uobj_get_obj_read should return the ib_uobject
> >   RDMA/uverbs: Delete the macro uobj_put_obj_read
> >   RDMA/nldev: ib_pd can be pointed by multiple ib_ucontext
> >   IB/{core,hw}: ib_pd should not have ib_uobject pointer
> >   IB/core: ib_uobject need HW object reference count
> >   IB/uverbs: Helper function to initialize ufile member of
> >     uverbs_attr_bundle
> >   IB/uverbs: Add context import lock/unlock helper
> >   IB/verbs: Prototype of HW object clone callback
> >   IB/mlx4: Add implementation of clone_pd callback
> >   IB/mlx5: Add implementation of clone_pd callback
> >   RDMA/rxe: Add implementation of clone_pd callback
> >   IB/uverbs: Add clone reference counting to ib_pd
> >   IB/uverbs: Add PD import verb
> >   IB/mlx4: Enable import from FD verb
> >   IB/mlx5: Enable import from FD verb
> >   RDMA/rxe: Enable import from FD verb
> > 
> > Yuval Shaia (8):
> >   IB/core: Install clone ib_pd in device ops
> >   IB/core: ib_mr should not have ib_uobject pointer
> >   IB/core: Install clone ib_mr in device ops
> >   IB/mlx4: Add implementation of clone_pd callback
> >   IB/mlx5: Add implementation of clone_pd callback
> >   RDMA/rxe: Add implementation of clone_pd callback
> >   IB/uverbs: Add clone reference counting to ib_mr
> >   IB/uverbs: Add MR import verb
> > 
> >  drivers/infiniband/core/device.c              |   2 +
> >  drivers/infiniband/core/nldev.c               | 127 ++++-
> >  drivers/infiniband/core/rdma_core.c           |  23 +-
> >  drivers/infiniband/core/uverbs.h              |   2 +
> >  drivers/infiniband/core/uverbs_cmd.c          | 489 +++++++++++++++---
> >  drivers/infiniband/core/uverbs_main.c         |   1 +
> >  drivers/infiniband/core/uverbs_std_types_mr.c |   1 -
> >  drivers/infiniband/core/verbs.c               |   4 -
> >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c    |   1 -
> >  drivers/infiniband/hw/mlx4/main.c             |  18 +-
> >  drivers/infiniband/hw/mlx5/main.c             |  34 +-
> >  drivers/infiniband/hw/mthca/mthca_qp.c        |   3 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.c         |   5 +
> >  include/rdma/ib_verbs.h                       |  43 +-
> >  include/rdma/uverbs_std_types.h               |  11 +-
> >  include/uapi/rdma/ib_user_verbs.h             |  15 +
> >  include/uapi/rdma/rdma_netlink.h              |   3 +
> >  17 files changed, 669 insertions(+), 113 deletions(-)
> > 
> > -- 
> > 2.20.1
> >
Yuval Shaia Aug. 22, 2019, 8:50 a.m. UTC | #4
On Wed, Aug 21, 2019 at 02:50:16PM +0000, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> > Following patch-set introduce the shared object feature.
> > 
> > A shared object feature allows one process to create HW objects (currently
> > PD and MR) so that a second process can import.
> > 
> > Patch-set is logically splits to 4 parts as the following:
> > - patches 1 to 7 and 18 are preparation steps.
> > - patches 8 to 14 are the implementation of import PD
> > - patches 15 to 17 are the implementation of the verb
> > - patches 19 to 24 are the implementation of import MR
> 
> This is way too big. 10-14 patches at most in a series.

I agree with you.
Actually i had an offline discussion with Shamir on that.
Shamir view point here is that he wanted to split things to smaller pieces
to ease the maintenance (git bisect etc) and code review.

So we have two options now, one is to split this patch-set into two
separate patch-sets, one will deal with preparation (infrastructure and
cleanups) and second with the actual feature. Or second option is to merge
some patches, e.x. the patches that installs the hook in providers code
could be merged.

Not to break Shamir's work i tend to go with the first option.

Shamir, what do you think?

Yuval

> 
> Jason
Doug Ledford Aug. 22, 2019, 2:15 p.m. UTC | #5
On Thu, 2019-08-22 at 11:41 +0300, Yuval Shaia wrote:
> On Wed, Aug 21, 2019 at 04:37:37PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> > > Following patch-set introduce the shared object feature.
> > > 
> > > A shared object feature allows one process to create HW objects
> > > (currently
> > > PD and MR) so that a second process can import.
> 
> Hi Ira,
> 
> > For something this fundamental I think the cover letter should be
> > more
> > detailed than this.  Questions I have without digging into the code:
> > 
> > What is the use case?
> 
> I have only one use case but i didn't added it to commit log just not
> to
> limit the usage of this feature but you are right, cover letter is
> great
> for such things, will add it for v2.
> 
> Anyway, here is our use case: Consider a case of server with huge
> amount
> of memory and some hundreds or even thousands processes are using it
> to
> serves clients requests. In this case the HCA will have to manage
> hundreds
> or thousands MRs. A better design maybe would be that one process will
> create one (or several) MR(s) which will be shared with the other
> processes. This will reduce the number of address translation entries
> and
> cache miss dramatically.

Unless I'm misreading you here, it will be at the expense of pretty much
all inter-process memory security.  You're talking about one process
creating some large MRs just to cover the overall memory in the machine,
then sharing that among processes, and all using it to reduce the MR
workload of the card.  This sounds like going back to the days of MSDos.
It also sounds like a programming error in one process could expose
potentially all processes data buffers across all processes sharing this
PD and MR.

I get the idea, and the problem you are trying to solve, but I'm not
sure that going down this path is wise.

Maybe....maybe if you limit a queue pair to send/recv only and no
rdma_{read,write}, then this wouldn't be quite as bad.  But even then
I'm still very leary of this "feature".

> 
> > What is the "key" that allows a MR to be shared among 2
> > processes?  Do you
> > introduce some PD identifier?  And then some {PDID, lkey} tuple is
> > used to ID
> > the MR?
> > 
> > I assume you have to share the PD first and then any MR in the
> > shared PD can be
> > shared?  If so how does the MR get shared?
> 
> Sorry, i'm not following.
> I think the term 'share' is somehow mistake, it is actually a process
> 'imports' objects into it's context. And yes, the workflow is first to
> import the PD and then import the MR.
> 
> > Again I'm concerned with how this will interact with the RDMA and
> > file system
> > interaction we have been trying to fix.
> 
> I'm not aware of this file-system thing, can you point me to some
> discussion on that so i'll see how this patch-set affect it.
> 
> > Why is SCM_RIGHTS on the rdma context FD not sufficient to share the
> > entire
> > context, PD, and all MR's?
> 
> Well, this SCM_RIGHTS is great, one can share the IB context with
> another.
> But it is not enough, because:
> - What API the second process can use to get his hands on one of the
> PDs or
>   MRs from this context?
> - What mechanism takes care of the destruction of such objects
> (SCM_RIGHTS
>   takes care for the ref counting of the context but i'm referring to
> the
>   PDs and MRs objects)?
> 
> The entire purpose of this patch set is to address these two
> questions.
> 
> Yuval
> 
> > Ira
> > 
> > > Patch-set is logically splits to 4 parts as the following:
> > > - patches 1 to 7 and 18 are preparation steps.
> > > - patches 8 to 14 are the implementation of import PD
> > > - patches 15 to 17 are the implementation of the verb
> > > - patches 19 to 24 are the implementation of import MR
> > > 
> > > v0 -> v1:
> > > 	* Delete the patch "IB/uverbs: ufile must be freed only when not
> > > 	  used anymore". The process can die, the ucontext remains until
> > > 	  last reference to it is closed.
> > > 	* Rebase to latest for-next branch
> > > 
> > > Shamir Rabinovitch (16):
> > >   RDMA/uverbs: uobj_get_obj_read should return the ib_uobject
> > >   RDMA/uverbs: Delete the macro uobj_put_obj_read
> > >   RDMA/nldev: ib_pd can be pointed by multiple ib_ucontext
> > >   IB/{core,hw}: ib_pd should not have ib_uobject pointer
> > >   IB/core: ib_uobject need HW object reference count
> > >   IB/uverbs: Helper function to initialize ufile member of
> > >     uverbs_attr_bundle
> > >   IB/uverbs: Add context import lock/unlock helper
> > >   IB/verbs: Prototype of HW object clone callback
> > >   IB/mlx4: Add implementation of clone_pd callback
> > >   IB/mlx5: Add implementation of clone_pd callback
> > >   RDMA/rxe: Add implementation of clone_pd callback
> > >   IB/uverbs: Add clone reference counting to ib_pd
> > >   IB/uverbs: Add PD import verb
> > >   IB/mlx4: Enable import from FD verb
> > >   IB/mlx5: Enable import from FD verb
> > >   RDMA/rxe: Enable import from FD verb
> > > 
> > > Yuval Shaia (8):
> > >   IB/core: Install clone ib_pd in device ops
> > >   IB/core: ib_mr should not have ib_uobject pointer
> > >   IB/core: Install clone ib_mr in device ops
> > >   IB/mlx4: Add implementation of clone_pd callback
> > >   IB/mlx5: Add implementation of clone_pd callback
> > >   RDMA/rxe: Add implementation of clone_pd callback
> > >   IB/uverbs: Add clone reference counting to ib_mr
> > >   IB/uverbs: Add MR import verb
> > > 
> > >  drivers/infiniband/core/device.c              |   2 +
> > >  drivers/infiniband/core/nldev.c               | 127 ++++-
> > >  drivers/infiniband/core/rdma_core.c           |  23 +-
> > >  drivers/infiniband/core/uverbs.h              |   2 +
> > >  drivers/infiniband/core/uverbs_cmd.c          | 489
> > > +++++++++++++++---
> > >  drivers/infiniband/core/uverbs_main.c         |   1 +
> > >  drivers/infiniband/core/uverbs_std_types_mr.c |   1 -
> > >  drivers/infiniband/core/verbs.c               |   4 -
> > >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c    |   1 -
> > >  drivers/infiniband/hw/mlx4/main.c             |  18 +-
> > >  drivers/infiniband/hw/mlx5/main.c             |  34 +-
> > >  drivers/infiniband/hw/mthca/mthca_qp.c        |   3 +-
> > >  drivers/infiniband/sw/rxe/rxe_verbs.c         |   5 +
> > >  include/rdma/ib_verbs.h                       |  43 +-
> > >  include/rdma/uverbs_std_types.h               |  11 +-
> > >  include/uapi/rdma/ib_user_verbs.h             |  15 +
> > >  include/uapi/rdma/rdma_netlink.h              |   3 +
> > >  17 files changed, 669 insertions(+), 113 deletions(-)
> > > 
> > > -- 
> > > 2.20.1
> > >
Ira Weiny Aug. 22, 2019, 4:58 p.m. UTC | #6
On Thu, Aug 22, 2019 at 11:41:03AM +0300, Yuval Shaia wrote:
> On Wed, Aug 21, 2019 at 04:37:37PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> > > Following patch-set introduce the shared object feature.
> > > 
> > > A shared object feature allows one process to create HW objects (currently
> > > PD and MR) so that a second process can import.
> 
> Hi Ira,
> 
> > 
> > For something this fundamental I think the cover letter should be more
> > detailed than this.  Questions I have without digging into the code:
> > 
> > What is the use case?
> 
> I have only one use case but i didn't added it to commit log just not to
> limit the usage of this feature but you are right, cover letter is great
> for such things, will add it for v2.
> 
> Anyway, here is our use case: Consider a case of server with huge amount
> of memory and some hundreds or even thousands processes are using it to
> serves clients requests. In this case the HCA will have to manage hundreds
> or thousands MRs. A better design maybe would be that one process will
> create one (or several) MR(s) which will be shared with the other
> processes. This will reduce the number of address translation entries and
> cache miss dramatically.

I think Doug covered my concerns in this area.

> 
> > 
> > What is the "key" that allows a MR to be shared among 2 processes?  Do you
> > introduce some PD identifier?  And then some {PDID, lkey} tuple is used to ID
> > the MR?
> > 
> > I assume you have to share the PD first and then any MR in the shared PD can be
> > shared?  If so how does the MR get shared?
> 
> Sorry, i'm not following.
> I think the term 'share' is somehow mistake, it is actually a process
> 'imports' objects into it's context. And yes, the workflow is first to
> import the PD and then import the MR.

Ok fair enough but the title of the thread is "Sharing PD and MR" so I used the
term Share.  And I expect that any random process can't import objects to which
the owning process does not allow them to right?

I mean you can't just have any process grab a PD and MR and start using it.  So
I assume there is some "sharing" by the originating process.

> 
> > 
> > Again I'm concerned with how this will interact with the RDMA and file system
> > interaction we have been trying to fix.
> 
> I'm not aware of this file-system thing, can you point me to some
> discussion on that so i'll see how this patch-set affect it.


https://lkml.org/lkml/2019/6/6/1101
https://lkml.org/lkml/2019/8/9/1043
https://lwn.net/Articles/796000/

There are many more articles, patch sets, discussion threads...  This work has
been going on much longer than I have been working on it.

> 
> > 
> > Why is SCM_RIGHTS on the rdma context FD not sufficient to share the entire
> > context, PD, and all MR's?
> 
> Well, this SCM_RIGHTS is great, one can share the IB context with another.
> But it is not enough, because:
> - What API the second process can use to get his hands on one of the PDs or
>   MRs from this context?

MRs can be passed by {PD,key} through any number of mechanisms.  All you need
is an ID for them.  Maybe this is clear in the code.  If so sorry about that.

> - What mechanism takes care of the destruction of such objects (SCM_RIGHTS
>   takes care for the ref counting of the context but i'm referring to the
>   PDs and MRs objects)?

This is inherent in the lifetime of the uverbs file object to which cloned FDs
(one in each process) have a reference to.

Add to your list "how does destruction of a MR in 1 process get communicated to
the other?"  Does the 2nd process just get failed WR's?

> 
> The entire purpose of this patch set is to address these two questions.

Fair enough but the cover letter should spell out the above and how this series
fixes that problem.

I have some of the same concerns as Doug WRT memory sharing.  FWIW I'm not sure
that what SCM_RIGHTS is doing is safe or correct.

For that work I'm really starting to think SCM_RIGHTS transfers should be
blocked.  It just seems wrong that Process B gets references to Process A's
mm_struct and holds the memory Process A allocated.  This seems wrong for any
type of memory, file system or not.  That said I'm assuming that this is all
within a single user so admins can at least determine who is pinning down all
this memory.

Ira
Jason Gunthorpe Aug. 22, 2019, 5:03 p.m. UTC | #7
On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:

> Add to your list "how does destruction of a MR in 1 process get communicated to
> the other?"  Does the 2nd process just get failed WR's?

IHMO a object that has been shared can no longer be asynchronously
destroyed. That is the whole point. A lkey/rkey # alone is inherently
unsafe without also holding a refcount on the MR.

> I have some of the same concerns as Doug WRT memory sharing.  FWIW I'm not sure
> that what SCM_RIGHTS is doing is safe or correct.
> 
> For that work I'm really starting to think SCM_RIGHTS transfers should be
> blocked.  

That isn't possible, SCM_RIGHTS is just some special case, fork(),
exec(), etc all cause the same situation. Any solution that blocks
those is a total non-starter.

> It just seems wrong that Process B gets references to Process A's
> mm_struct and holds the memory Process A allocated.  

Except for ODP, a MR doesn't reference the mm_struct. It references
the pages. It is not unlike a memfd.

Jason
Ira Weiny Aug. 22, 2019, 8:10 p.m. UTC | #8
> On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> 
> > Add to your list "how does destruction of a MR in 1 process get
> > communicated to the other?"  Does the 2nd process just get failed WR's?
> 
> IHMO a object that has been shared can no longer be asynchronously destroyed.
> That is the whole point. A lkey/rkey # alone is inherently unsafe without also
> holding a refcount on the MR.
> 
> > I have some of the same concerns as Doug WRT memory sharing.  FWIW I'm
> > not sure that what SCM_RIGHTS is doing is safe or correct.
> >
> > For that work I'm really starting to think SCM_RIGHTS transfers should
> > be blocked.
> 
> That isn't possible, SCM_RIGHTS is just some special case, fork(), exec(), etc all
> cause the same situation. Any solution that blocks those is a total non-starter.

Right, except in the case of fork(), exec() all of the file system references which may be pinned also get copied.  With SCM_RIGHTS they may not...  And my concern here is, if I understand this mechanism, it would introduce another avenue where the file pin is shared _without_ the file lease (or with a potentially zombie'ed lease).[1]

[1] https://lkml.org/lkml/2019/8/16/994

> 
> > It just seems wrong that Process B gets references to Process A's
> > mm_struct and holds the memory Process A allocated.
> 
> Except for ODP, a MR doesn't reference the mm_struct. It references the pages.
> It is not unlike a memfd.

I'm thinking of the owner_mm...  It is not like it is holding the entire process address space I know that.  But it is holding onto memory which Process A allocated.

Ira

> 
> Jason
Jason Gunthorpe Aug. 23, 2019, 11:57 a.m. UTC | #9
On Thu, Aug 22, 2019 at 08:10:09PM +0000, Weiny, Ira wrote:
> > On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> > 
> > > Add to your list "how does destruction of a MR in 1 process get
> > > communicated to the other?"  Does the 2nd process just get failed WR's?
> > 
> > IHMO a object that has been shared can no longer be asynchronously destroyed.
> > That is the whole point. A lkey/rkey # alone is inherently unsafe without also
> > holding a refcount on the MR.
> > 
> > > I have some of the same concerns as Doug WRT memory sharing.  FWIW I'm
> > > not sure that what SCM_RIGHTS is doing is safe or correct.
> > >
> > > For that work I'm really starting to think SCM_RIGHTS transfers should
> > > be blocked.
> > 
> > That isn't possible, SCM_RIGHTS is just some special case, fork(), exec(), etc all
> > cause the same situation. Any solution that blocks those is a total non-starter.
> 
> Right, except in the case of fork(), exec() all of the file system
> references which may be pinned also get copied.  

And what happens one one child of the fork closes the reference, or
exec with CLOEXEC causes it to no inherent?

It completely breaks the unix model to tie something to a process not
to a FD.

> > Except for ODP, a MR doesn't reference the mm_struct. It references the pages.
> > It is not unlike a memfd.
> 
> I'm thinking of the owner_mm...  It is not like it is holding the
> entire process address space I know that.  But it is holding onto
> memory which Process A allocated.

It only hold the mm for some statistics accounting, it is really just
holding pages outside the mm.

Jason
Ira Weiny Aug. 23, 2019, 9:33 p.m. UTC | #10
> Subject: Re: [PATCH v1 00/24] Shared PD and MR
> 
> On Thu, Aug 22, 2019 at 08:10:09PM +0000, Weiny, Ira wrote:
> > > On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> > >
> > > > Add to your list "how does destruction of a MR in 1 process get
> > > > communicated to the other?"  Does the 2nd process just get failed
> WR's?
> > >
> > > IHMO a object that has been shared can no longer be asynchronously
> destroyed.
> > > That is the whole point. A lkey/rkey # alone is inherently unsafe
> > > without also holding a refcount on the MR.
> > >
> > > > I have some of the same concerns as Doug WRT memory sharing.
> FWIW
> > > > I'm not sure that what SCM_RIGHTS is doing is safe or correct.
> > > >
> > > > For that work I'm really starting to think SCM_RIGHTS transfers
> > > > should be blocked.
> > >
> > > That isn't possible, SCM_RIGHTS is just some special case, fork(),
> > > exec(), etc all cause the same situation. Any solution that blocks those is a
> total non-starter.
> >
> > Right, except in the case of fork(), exec() all of the file system
> > references which may be pinned also get copied.
> 
> And what happens one one child of the fork closes the reference, or exec with
> CLOEXEC causes it to no inherent?

Dave Chinner is suggesting the close will hang.  Exec with CLOEXEC would probably not because the RDMA close would release the pin allowing the close of the data file to finish...  At least as far as my testing has shown.

> 
> It completely breaks the unix model to tie something to a process not to a
> FD.

But that is just it.  Dave is advocating that the FD's must get transferred.  It has nothing to do with a process.

I'm somewhat confused at this point because in this thread I was advocating that the RDMA context FD is what needs to get "shared" between the processes.  Is that what you are advocating as well?  Does this patch set do that?

> 
> > > Except for ODP, a MR doesn't reference the mm_struct. It references the
> pages.
> > > It is not unlike a memfd.
> >
> > I'm thinking of the owner_mm...  It is not like it is holding the
> > entire process address space I know that.  But it is holding onto
> > memory which Process A allocated.
> 
> It only hold the mm for some statistics accounting, it is really just holding
> pages outside the mm.

But those pages aren't necessarily mapped in Process B.  and if they are mapped in Process A then you are sending data to Process A not "B"...  That is one twisted way to look at it anyway...

Ira
Yuval Shaia Aug. 26, 2019, 9:35 a.m. UTC | #11
On Thu, Aug 22, 2019 at 10:15:11AM -0400, Doug Ledford wrote:
> On Thu, 2019-08-22 at 11:41 +0300, Yuval Shaia wrote:
> > On Wed, Aug 21, 2019 at 04:37:37PM -0700, Ira Weiny wrote:
> > > On Wed, Aug 21, 2019 at 05:21:01PM +0300, Yuval Shaia wrote:
> > > > Following patch-set introduce the shared object feature.
> > > > 
> > > > A shared object feature allows one process to create HW objects
> > > > (currently
> > > > PD and MR) so that a second process can import.
> > 
> > Hi Ira,
> > 
> > > For something this fundamental I think the cover letter should be
> > > more
> > > detailed than this.  Questions I have without digging into the code:
> > > 
> > > What is the use case?
> > 
> > I have only one use case but i didn't added it to commit log just not
> > to
> > limit the usage of this feature but you are right, cover letter is
> > great
> > for such things, will add it for v2.
> > 
> > Anyway, here is our use case: Consider a case of server with huge
> > amount
> > of memory and some hundreds or even thousands processes are using it
> > to
> > serves clients requests. In this case the HCA will have to manage
> > hundreds
> > or thousands MRs. A better design maybe would be that one process will
> > create one (or several) MR(s) which will be shared with the other
> > processes. This will reduce the number of address translation entries
> > and
> > cache miss dramatically.
> 
> Unless I'm misreading you here, it will be at the expense of pretty much
> all inter-process memory security.  You're talking about one process

Isn't it already there with the use of Linux shared memory?

> creating some large MRs just to cover the overall memory in the machine,
> then sharing that among processes, and all using it to reduce the MR
> workload of the card.  This sounds like going back to the days of MSDos.

Well, too many MRs can lead to serious bottleneck, we are currently dealing
with such issue when many VMs are trying to re-register their MRs at once,
but since it is out of the scope of $subject i will not expand, just
mentioning it because *it is* an issue and educing the number of MRs could
help.

> It also sounds like a programming error in one process could expose
> potentially all processes data buffers across all processes sharing this
> PD and MR.

Again, this is already possible with shared memory and some designs trusts
on that.

> 
> I get the idea, and the problem you are trying to solve, but I'm not
> sure that going down this path is wise.
> 
> Maybe....maybe if you limit a queue pair to send/recv only and no
> rdma_{read,write}, then this wouldn't be quite as bad.  But even then
> I'm still very leary of this "feature".

How about if all the processes are considered as one unit of trust? anyway
this could be done in a multi threaded application or when one process
forks child processes.

> 
> > 
> > > What is the "key" that allows a MR to be shared among 2
> > > processes?  Do you
> > > introduce some PD identifier?  And then some {PDID, lkey} tuple is
> > > used to ID
> > > the MR?
> > > 
> > > I assume you have to share the PD first and then any MR in the
> > > shared PD can be
> > > shared?  If so how does the MR get shared?
> > 
> > Sorry, i'm not following.
> > I think the term 'share' is somehow mistake, it is actually a process
> > 'imports' objects into it's context. And yes, the workflow is first to
> > import the PD and then import the MR.
> > 
> > > Again I'm concerned with how this will interact with the RDMA and
> > > file system
> > > interaction we have been trying to fix.
> > 
> > I'm not aware of this file-system thing, can you point me to some
> > discussion on that so i'll see how this patch-set affect it.
> > 
> > > Why is SCM_RIGHTS on the rdma context FD not sufficient to share the
> > > entire
> > > context, PD, and all MR's?
> > 
> > Well, this SCM_RIGHTS is great, one can share the IB context with
> > another.
> > But it is not enough, because:
> > - What API the second process can use to get his hands on one of the
> > PDs or
> >   MRs from this context?
> > - What mechanism takes care of the destruction of such objects
> > (SCM_RIGHTS
> >   takes care for the ref counting of the context but i'm referring to
> > the
> >   PDs and MRs objects)?
> > 
> > The entire purpose of this patch set is to address these two
> > questions.
> > 
> > Yuval
> > 
> > > Ira
> > > 
> > > > Patch-set is logically splits to 4 parts as the following:
> > > > - patches 1 to 7 and 18 are preparation steps.
> > > > - patches 8 to 14 are the implementation of import PD
> > > > - patches 15 to 17 are the implementation of the verb
> > > > - patches 19 to 24 are the implementation of import MR
> > > > 
> > > > v0 -> v1:
> > > > 	* Delete the patch "IB/uverbs: ufile must be freed only when not
> > > > 	  used anymore". The process can die, the ucontext remains until
> > > > 	  last reference to it is closed.
> > > > 	* Rebase to latest for-next branch
> > > > 
> > > > Shamir Rabinovitch (16):
> > > >   RDMA/uverbs: uobj_get_obj_read should return the ib_uobject
> > > >   RDMA/uverbs: Delete the macro uobj_put_obj_read
> > > >   RDMA/nldev: ib_pd can be pointed by multiple ib_ucontext
> > > >   IB/{core,hw}: ib_pd should not have ib_uobject pointer
> > > >   IB/core: ib_uobject need HW object reference count
> > > >   IB/uverbs: Helper function to initialize ufile member of
> > > >     uverbs_attr_bundle
> > > >   IB/uverbs: Add context import lock/unlock helper
> > > >   IB/verbs: Prototype of HW object clone callback
> > > >   IB/mlx4: Add implementation of clone_pd callback
> > > >   IB/mlx5: Add implementation of clone_pd callback
> > > >   RDMA/rxe: Add implementation of clone_pd callback
> > > >   IB/uverbs: Add clone reference counting to ib_pd
> > > >   IB/uverbs: Add PD import verb
> > > >   IB/mlx4: Enable import from FD verb
> > > >   IB/mlx5: Enable import from FD verb
> > > >   RDMA/rxe: Enable import from FD verb
> > > > 
> > > > Yuval Shaia (8):
> > > >   IB/core: Install clone ib_pd in device ops
> > > >   IB/core: ib_mr should not have ib_uobject pointer
> > > >   IB/core: Install clone ib_mr in device ops
> > > >   IB/mlx4: Add implementation of clone_pd callback
> > > >   IB/mlx5: Add implementation of clone_pd callback
> > > >   RDMA/rxe: Add implementation of clone_pd callback
> > > >   IB/uverbs: Add clone reference counting to ib_mr
> > > >   IB/uverbs: Add MR import verb
> > > > 
> > > >  drivers/infiniband/core/device.c              |   2 +
> > > >  drivers/infiniband/core/nldev.c               | 127 ++++-
> > > >  drivers/infiniband/core/rdma_core.c           |  23 +-
> > > >  drivers/infiniband/core/uverbs.h              |   2 +
> > > >  drivers/infiniband/core/uverbs_cmd.c          | 489
> > > > +++++++++++++++---
> > > >  drivers/infiniband/core/uverbs_main.c         |   1 +
> > > >  drivers/infiniband/core/uverbs_std_types_mr.c |   1 -
> > > >  drivers/infiniband/core/verbs.c               |   4 -
> > > >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c    |   1 -
> > > >  drivers/infiniband/hw/mlx4/main.c             |  18 +-
> > > >  drivers/infiniband/hw/mlx5/main.c             |  34 +-
> > > >  drivers/infiniband/hw/mthca/mthca_qp.c        |   3 +-
> > > >  drivers/infiniband/sw/rxe/rxe_verbs.c         |   5 +
> > > >  include/rdma/ib_verbs.h                       |  43 +-
> > > >  include/rdma/uverbs_std_types.h               |  11 +-
> > > >  include/uapi/rdma/ib_user_verbs.h             |  15 +
> > > >  include/uapi/rdma/rdma_netlink.h              |   3 +
> > > >  17 files changed, 669 insertions(+), 113 deletions(-)
> > > > 
> > > > -- 
> > > > 2.20.1
> > > > 
> 
> -- 
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Yuval Shaia Aug. 26, 2019, 9:51 a.m. UTC | #12
> > 
> > > 
> > > What is the "key" that allows a MR to be shared among 2 processes?  Do you
> > > introduce some PD identifier?  And then some {PDID, lkey} tuple is used to ID
> > > the MR?
> > > 
> > > I assume you have to share the PD first and then any MR in the shared PD can be
> > > shared?  If so how does the MR get shared?
> > 
> > Sorry, i'm not following.
> > I think the term 'share' is somehow mistake, it is actually a process
> > 'imports' objects into it's context. And yes, the workflow is first to
> > import the PD and then import the MR.
> 
> Ok fair enough but the title of the thread is "Sharing PD and MR" so I used the

You are right, my bad, will change the cover letter and some commit
messages accordingly.

> term Share.  And I expect that any random process can't import objects to which
> the owning process does not allow them to right?
> 
> I mean you can't just have any process grab a PD and MR and start using it.  So
> I assume there is some "sharing" by the originating process.

Any process that connects to the socket that the SCM_RIGHT message is
relayed on. I guess that if this mechanism exist then importing the actual
objects is just a supplemental service.

> 
> >
Yuval Shaia Aug. 26, 2019, 10:04 a.m. UTC | #13
> 
> > - What mechanism takes care of the destruction of such objects (SCM_RIGHTS
> >   takes care for the ref counting of the context but i'm referring to the
> >   PDs and MRs objects)?
> 
> This is inherent in the lifetime of the uverbs file object to which cloned FDs
> (one in each process) have a reference to.
> 
> Add to your list "how does destruction of a MR in 1 process get communicated to
> the other?"  Does the 2nd process just get failed WR's?

I meant the opposite, i.e. when two processes are sharing an object, the
fact that one decides to destroy it cannot affect the other so a ref count
needs to be maintained so object will be disposed only in case of all
references asked for destruction.

>
Yuval Shaia Aug. 26, 2019, 10:10 a.m. UTC | #14
> 
> > 
> > > 
> > > Why is SCM_RIGHTS on the rdma context FD not sufficient to share the entire
> > > context, PD, and all MR's?
> > 
> > Well, this SCM_RIGHTS is great, one can share the IB context with another.
> > But it is not enough, because:
> > - What API the second process can use to get his hands on one of the PDs or
> >   MRs from this context?
> 
> MRs can be passed by {PD,key} through any number of mechanisms.  All you need
> is an ID for them.  Maybe this is clear in the code.  If so sorry about that.

So given an ID of a PD, what is the function is can use to get the pointer
to the ibv_pd object?

>
Yuval Shaia Aug. 26, 2019, 10:29 a.m. UTC | #15
On Thu, Aug 22, 2019 at 05:03:15PM +0000, Jason Gunthorpe wrote:
> On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> 
> > Add to your list "how does destruction of a MR in 1 process get communicated to
> > the other?"  Does the 2nd process just get failed WR's?
> 
> IHMO a object that has been shared can no longer be asynchronously
> destroyed. That is the whole point. A lkey/rkey # alone is inherently
> unsafe without also holding a refcount on the MR.

You meant to say "can no longer be synchronously destroyed", right?

> 
> > I have some of the same concerns as Doug WRT memory sharing.  FWIW I'm not sure
> > that what SCM_RIGHTS is doing is safe or correct.
> > 
> > For that work I'm really starting to think SCM_RIGHTS transfers should be
> > blocked.  
> 
> That isn't possible, SCM_RIGHTS is just some special case, fork(),
> exec(), etc all cause the same situation. Any solution that blocks
> those is a total non-starter.
> 
> > It just seems wrong that Process B gets references to Process A's
> > mm_struct and holds the memory Process A allocated.  
> 
> Except for ODP, a MR doesn't reference the mm_struct. It references
> the pages. It is not unlike a memfd.
> 
> Jason
Yuval Shaia Aug. 26, 2019, 10:58 a.m. UTC | #16
On Fri, Aug 23, 2019 at 09:33:06PM +0000, Weiny, Ira wrote:
> > Subject: Re: [PATCH v1 00/24] Shared PD and MR
> > 
> > On Thu, Aug 22, 2019 at 08:10:09PM +0000, Weiny, Ira wrote:
> > > > On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> > > >
> > > > > Add to your list "how does destruction of a MR in 1 process get
> > > > > communicated to the other?"  Does the 2nd process just get failed
> > WR's?
> > > >
> > > > IHMO a object that has been shared can no longer be asynchronously
> > destroyed.
> > > > That is the whole point. A lkey/rkey # alone is inherently unsafe
> > > > without also holding a refcount on the MR.
> > > >
> > > > > I have some of the same concerns as Doug WRT memory sharing.
> > FWIW
> > > > > I'm not sure that what SCM_RIGHTS is doing is safe or correct.
> > > > >
> > > > > For that work I'm really starting to think SCM_RIGHTS transfers
> > > > > should be blocked.
> > > >
> > > > That isn't possible, SCM_RIGHTS is just some special case, fork(),
> > > > exec(), etc all cause the same situation. Any solution that blocks those is a
> > total non-starter.
> > >
> > > Right, except in the case of fork(), exec() all of the file system
> > > references which may be pinned also get copied.
> > 
> > And what happens one one child of the fork closes the reference, or exec with
> > CLOEXEC causes it to no inherent?
> 
> Dave Chinner is suggesting the close will hang.  Exec with CLOEXEC would probably not because the RDMA close would release the pin allowing the close of the data file to finish...  At least as far as my testing has shown.
> 
> > 
> > It completely breaks the unix model to tie something to a process not to a
> > FD.
> 
> But that is just it.  Dave is advocating that the FD's must get transferred.  It has nothing to do with a process.
> 
> I'm somewhat confused at this point because in this thread I was advocating that the RDMA context FD is what needs to get "shared" between the processes.  Is that what you are advocating as well?  Does this patch set do that?

The IB context sharing mechanism is already exist. This patch-set purpose a
way of importing and maintaining the IBV objects of such shared IB context.

> 
> > 
> > > > Except for ODP, a MR doesn't reference the mm_struct. It references the
> > pages.
> > > > It is not unlike a memfd.
> > >
> > > I'm thinking of the owner_mm...  It is not like it is holding the
> > > entire process address space I know that.  But it is holding onto
> > > memory which Process A allocated.
> > 
> > It only hold the mm for some statistics accounting, it is really just holding
> > pages outside the mm.
> 
> But those pages aren't necessarily mapped in Process B.  and if they are mapped in Process A then you are sending data to Process A not "B"...  That is one twisted way to look at it anyway...
> 
> Ira
>
Jason Gunthorpe Aug. 26, 2019, 12:26 p.m. UTC | #17
On Mon, Aug 26, 2019 at 01:29:27PM +0300, Yuval Shaia wrote:
> On Thu, Aug 22, 2019 at 05:03:15PM +0000, Jason Gunthorpe wrote:
> > On Thu, Aug 22, 2019 at 09:58:42AM -0700, Ira Weiny wrote:
> > 
> > > Add to your list "how does destruction of a MR in 1 process get communicated to
> > > the other?"  Does the 2nd process just get failed WR's?
> > 
> > IHMO a object that has been shared can no longer be asynchronously
> > destroyed. That is the whole point. A lkey/rkey # alone is inherently
> > unsafe without also holding a refcount on the MR.
> 
> You meant to say "can no longer be synchronously destroyed", right?

No, I mean a another process cannot just rip the rkey out from a
process that is using it, asynchronously

Jason