diff mbox series

[RFC,v2,16/19] RDMA/uverbs: Add back pointer to system file object

Message ID 20190809225833.6657-17-ira.weiny@intel.com (mailing list archive)
State New, archived
Headers show
Series RDMA/FS DAX truncate proposal V1,000,002 ;-) | expand

Commit Message

Ira Weiny Aug. 9, 2019, 10:58 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

In order for MRs to be tracked against the open verbs context the ufile
needs to have a pointer to hand to the GUP code.

No references need to be taken as this should be valid for the lifetime
of the context.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/infiniband/core/uverbs.h      | 1 +
 drivers/infiniband/core/uverbs_main.c | 1 +
 2 files changed, 2 insertions(+)

Comments

Ira Weiny Aug. 12, 2019, 5:28 p.m. UTC | #1
On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In order for MRs to be tracked against the open verbs context the ufile
> > needs to have a pointer to hand to the GUP code.
> > 
> > No references need to be taken as this should be valid for the lifetime
> > of the context.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >  drivers/infiniband/core/uverbs.h      | 1 +
> >  drivers/infiniband/core/uverbs_main.c | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > index 1e5aeb39f774..e802ba8c67d6 100644
> > +++ b/drivers/infiniband/core/uverbs.h
> > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> >  	struct page *disassociate_page;
> >  
> >  	struct xarray		idr;
> > +	struct file             *sys_file; /* backpointer to system file object */
> >  };
> 
> The 'struct file' has a lifetime strictly shorter than the
> ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> pointer like this is confouding as it will be invalid for some of the
> lifetime of the struct.

Ah...  ok.  I really thought it was the other way around.

__fput() should not call ib_uverbs_close() until the last reference on struct
file is released...  What holds references to struct ib_uverbs_file past that?

Perhaps I need to add this (untested)?

diff --git a/drivers/infiniband/core/uverbs_main.c
b/drivers/infiniband/core/uverbs_main.c
index f628f9e4c09f..654e774d9cf2 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1125,6 +1125,8 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
        list_del_init(&file->list);
        mutex_unlock(&file->device->lists_mutex);
 
+       file->sys_file = NULL;
+
        kref_put(&file->ref, ib_uverbs_release_file);
 
        return 0;
Ira Weiny Aug. 12, 2019, 9:15 p.m. UTC | #2
On Mon, Aug 12, 2019 at 02:56:15PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> > On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > In order for MRs to be tracked against the open verbs context the ufile
> > > > needs to have a pointer to hand to the GUP code.
> > > > 
> > > > No references need to be taken as this should be valid for the lifetime
> > > > of the context.
> > > > 
> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > >  drivers/infiniband/core/uverbs.h      | 1 +
> > > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > > >  2 files changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > > +++ b/drivers/infiniband/core/uverbs.h
> > > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > > >  	struct page *disassociate_page;
> > > >  
> > > >  	struct xarray		idr;
> > > > +	struct file             *sys_file; /* backpointer to system file object */
> > > >  };
> > > 
> > > The 'struct file' has a lifetime strictly shorter than the
> > > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > > pointer like this is confouding as it will be invalid for some of the
> > > lifetime of the struct.
> > 
> > Ah...  ok.  I really thought it was the other way around.
> > 
> > __fput() should not call ib_uverbs_close() until the last reference on struct
> > file is released...  What holds references to struct ib_uverbs_file past that?
> 
> Child fds hold onto the internal ib_uverbs_file until they are closed

The FDs hold the struct file, don't they?

> 
> > Perhaps I need to add this (untested)?
> > 
> > diff --git a/drivers/infiniband/core/uverbs_main.c
> > b/drivers/infiniband/core/uverbs_main.c
> > index f628f9e4c09f..654e774d9cf2 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -1125,6 +1125,8 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
> >         list_del_init(&file->list);
> >         mutex_unlock(&file->device->lists_mutex);
> >  
> > +       file->sys_file = NULL;
> 
> Now this has unlocked updates to that data.. you'd need some lock and
> get not zero pattern

You can't call "get" here because I'm 99% sure we only get here when struct
file has no references left...  I could be wrong.  It took me a while to work
through the reference counting so I could have missed something.

Ira
Ira Weiny Aug. 13, 2019, 5:41 p.m. UTC | #3
On Tue, Aug 13, 2019 at 08:48:42AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2019 at 02:15:37PM -0700, Ira Weiny wrote:
> > On Mon, Aug 12, 2019 at 02:56:15PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> > > > On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > 
> > > > > > In order for MRs to be tracked against the open verbs context the ufile
> > > > > > needs to have a pointer to hand to the GUP code.
> > > > > > 
> > > > > > No references need to be taken as this should be valid for the lifetime
> > > > > > of the context.
> > > > > > 
> > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > > > >  drivers/infiniband/core/uverbs.h      | 1 +
> > > > > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > > > > >  2 files changed, 2 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > > > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > > > > +++ b/drivers/infiniband/core/uverbs.h
> > > > > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > > > > >  	struct page *disassociate_page;
> > > > > >  
> > > > > >  	struct xarray		idr;
> > > > > > +	struct file             *sys_file; /* backpointer to system file object */
> > > > > >  };
> > > > > 
> > > > > The 'struct file' has a lifetime strictly shorter than the
> > > > > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > > > > pointer like this is confouding as it will be invalid for some of the
> > > > > lifetime of the struct.
> > > > 
> > > > Ah...  ok.  I really thought it was the other way around.
> > > > 
> > > > __fput() should not call ib_uverbs_close() until the last reference on struct
> > > > file is released...  What holds references to struct ib_uverbs_file past that?
> > > 
> > > Child fds hold onto the internal ib_uverbs_file until they are closed
> > 
> > The FDs hold the struct file, don't they?
> 
> Only dups, there are other 'child' FDs we can create
> 
> > > Now this has unlocked updates to that data.. you'd need some lock and
> > > get not zero pattern
> > 
> > You can't call "get" here because I'm 99% sure we only get here when struct
> > file has no references left...
> 
> Nope, like I said the other FDs hold the uverbs_file independent of
> the struct file it is related too. 

<sigh>

We don't allow memory registrations to be created with those other FDs...

And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
that some other thread is) destroying all the MR's we have associated with this
FD.

I'll have to think on this more since uverbs_destroy_ufile_hw() does not
block...  Which means there is a window here within the GUP code...  :-/

> 
> This is why having a back pointer like this is so ugly, it creates a
> reference counting cycle

Yep...  I worked through this...  and it was giving me fits...

Anyway, the struct file is the only object in the core which was reasonable to
store this information in since that is what is passed around to other
processes...

Another idea I explored was to create a callback into the driver from the core
which put the responsibility of printing the pin information on the driver.

But that started to be (and is likely going to be) a pretty complicated "dance"
between the core and the drivers so I went this way...

I also thought about holding some other reference on struct file which would
allow release to be called while keeping struct file around.  But that seemed
crazy...

Ira
Ira Weiny Aug. 13, 2019, 8:38 p.m. UTC | #4
On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> 
> > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > that some other thread is) destroying all the MR's we have associated with this
> > FD.
> 
> fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> deletes any underlying HW resources, but the FD persists.

I misspoke.  I should have said associated with this "context".  And of course
uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
struct file which had file_pins hanging off of it would be getting its file
pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
after uverbs_destroy_ufile_hw() is done.

But since it does not block it may be that the struct file is gone before the
MR is actually destroyed.  Which means I think the GUP code would blow up in
that case...  :-(

I was thinking it was the other way around.  And in fact most of the time I
think it is.  But we can't depend on that...

>  
> > > This is why having a back pointer like this is so ugly, it creates a
> > > reference counting cycle
> > 
> > Yep...  I worked through this...  and it was giving me fits...
> > 
> > Anyway, the struct file is the only object in the core which was reasonable to
> > store this information in since that is what is passed around to other
> > processes...
> 
> It could be passed down in the uattr_bundle, once you are in file operations

What is "It"?  The struct file?  Or the file pin information?

> handle the file is guarenteed to exist, and we've now arranged things

I don't understand what you mean by "... once you are in file operations handle... "?

> so the uattr_bundle flows into the umem, as umems can only be
> established under a system call.

"uattr_bundle" == uverbs_attr_bundle right?

The problem is that I don't think the core should be handling
uverbs_attr_bundles directly.  So, I think you are driving at the same idea I
had WRT callbacks into the driver.

The drivers could provide some generic object (in RDMA this could be the
uverbs_attr_bundle) which represents their "context".

The GUP code calls back into the driver with file pin information as it
encounters and pins pages.  The driver, RDMA in this case, associates this
information with the "context".

But for the procfs interface, that context then needs to be associated with any
file which points to it...  For RDMA, or any other "FD based pin mechanism", it
would be up to the driver to "install" a procfs handler into any struct file
which _may_ point to this context.  (before _or_ after memory pins).

Then the procfs code can walk the FD array and if this handler is installed it
knows there is file pin information associated with that struct file and it can
be printed...

This is not impossible.  But I think is a lot harder for drivers to make
right...

Ira
Ira Weiny Aug. 14, 2019, 5:50 p.m. UTC | #5
On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > 
> > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > that some other thread is) destroying all the MR's we have associated with this
> > > > FD.
> > > 
> > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > deletes any underlying HW resources, but the FD persists.
> > 
> > I misspoke.  I should have said associated with this "context".  And of course
> > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > struct file which had file_pins hanging off of it would be getting its file
> > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > after uverbs_destroy_ufile_hw() is done.
> > 
> > But since it does not block it may be that the struct file is gone before the
> > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > that case...  :-(
> 
> Oh, yes, that is true, you also can't rely on the struct file living
> longer than the HW objects either, that isn't how the lifetime model
> works.
> 
> If GUP consumes the struct file it must allow the struct file to be
> deleted before the GUP pin is released.

I may have to think about this a bit.  But I'm starting to lean toward my
callback method as a solution...

> 
> > The drivers could provide some generic object (in RDMA this could be the
> > uverbs_attr_bundle) which represents their "context".
> 
> For RDMA the obvious context is the struct ib_mr *

Not really, but maybe.  See below regarding tracking this across processes.

> 
> > But for the procfs interface, that context then needs to be associated with any
> > file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> > would be up to the driver to "install" a procfs handler into any struct file
> > which _may_ point to this context.  (before _or_ after memory pins).
> 
> Is this all just for debugging? Seems like a lot of complication just
> to print a string

No, this is a requirement to allow an admin to determine why their truncates
may be failing.  As per our discussion here:

https://lkml.org/lkml/2019/6/7/982

Looking back at the thread apparently no one confirmed my question (assertion).
But no one objected to it either!  :-D  From that post:

	"... if we can keep track of who has the pins in lsof can we agree no
	process needs to be SIGKILL'ed?  Admins can do this on their own
	"killing" if they really need to stop the use of these files, right?"

This is what I am trying to do here is ensure that no matter what the user
does.  Fork, munmap, SCM_RIGHTS, close (on any FD), the underlying pin is
associated to any process which has access to those pins and is holding
references to those pages.  Then any user of the system who gets a failing
truncate can figure out which processes are holding this up.

> 
> Generally, I think you'd be better to associate things with the
> mm_struct not some struct file... The whole design is simpler as GUP
> already has the mm_struct.

I wish I _could_ do that...  And for some simple users I do that.  This is why
rdma_pin has the option to track against mm_struct _OR_ struct file.

At first it seemed like carrying over the mm_struct info during fork would
work...  but then there is SCM_RIGHTS where one can share the RDMA context with
any "random" process...  AFAICS struct file has no concept of mm_struct (nor
should it) so the dup for SCM_RIGHTS processing would not be able to do this.
A further complication was that when the RDMA FD is dup'ed the RDMA subsystem
does not know about it...  So it was not straight forward to have the RDMA
subsystem do this either.  Not to mention that would be yet another
complication the drivers would have to deal with...  I think you had similar
issues which lead to the use of an "owning_mm" in the umem object.  So while
_some_ mm_struct is held it may not be visible to the user since that mm_struct
may belong to a process which is gone... Or even if not gone, killing it would not
fully remove the pin...

So keeping this tracked against struct file works (and seemed straight forward)
no matter where/how the RDMA FD is shared...  Even with the complication above
I still think it is easier to do this way.

If I am missing something WRT the mm_struct "I'm all ears".

Ira
Ira Weiny Sept. 4, 2019, 10:25 p.m. UTC | #6
On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > 
> > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > that some other thread is) destroying all the MR's we have associated with this
> > > > FD.
> > > 
> > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > deletes any underlying HW resources, but the FD persists.
> > 
> > I misspoke.  I should have said associated with this "context".  And of course
> > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > struct file which had file_pins hanging off of it would be getting its file
> > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > after uverbs_destroy_ufile_hw() is done.
> > 
> > But since it does not block it may be that the struct file is gone before the
> > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > that case...  :-(
> 
> Oh, yes, that is true, you also can't rely on the struct file living
> longer than the HW objects either, that isn't how the lifetime model
> works.

Reviewing all these old threads...  And this made me think.  While the HW
objects may out live the struct file.

They _are_ going away in a finite amount of time right?  It is not like they
could be held forever right?

Ira

> 
> If GUP consumes the struct file it must allow the struct file to be
> deleted before the GUP pin is released.
> 
> > The drivers could provide some generic object (in RDMA this could be the
> > uverbs_attr_bundle) which represents their "context".
> 
> For RDMA the obvious context is the struct ib_mr *
> 
> > But for the procfs interface, that context then needs to be associated with any
> > file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> > would be up to the driver to "install" a procfs handler into any struct file
> > which _may_ point to this context.  (before _or_ after memory pins).
> 
> Is this all just for debugging? Seems like a lot of complication just
> to print a string
> 
> Generally, I think you'd be better to associate things with the
> mm_struct not some struct file... The whole design is simpler as GUP
> already has the mm_struct.
> 
> Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 1e5aeb39f774..e802ba8c67d6 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -163,6 +163,7 @@  struct ib_uverbs_file {
 	struct page *disassociate_page;
 
 	struct xarray		idr;
+	struct file             *sys_file; /* backpointer to system file object */
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 11c13c1381cf..002c24e0d4db 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1092,6 +1092,7 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	INIT_LIST_HEAD(&file->umaps);
 
 	filp->private_data = file;
+	file->sys_file = filp;
 	list_add_tail(&file->list, &dev->uverbs_file_list);
 	mutex_unlock(&dev->lists_mutex);
 	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);