diff mbox

[V4] IB/uverbs: Fix race between uverbs_close and remove_one

Message ID 1457795927-16634-1-git-send-email-devesh.sharma@broadcom.com (mailing list archive)
State Superseded
Headers show

Commit Message

Devesh Sharma March 12, 2016, 3:18 p.m. UTC
Fixes: 35d4a0b63dc0 ("IB/uverbs: Fix race between ib_uverbs_open and remove_one")

If "rmmod <vendor-driver>" is done while having rdma applications still running
on a host, the system crashes in the page-fault handler trying to fetch
physical address of an daggling device pointer.

During rmmod every vendor driver must call ib_unregister_device. As part of
this call, IB-stack tries to free-up all the resource associated with the
leaving driver. During the call to ib_uverbs_remove_one, a fatal-event is given
to all the alive rdma applications. The fatal-event causes applications to call
ib_uverbs_close(). Thus, causes two different cleanup context to run in parallel.
In the above scenario, it is possible that ib_uverbs_remove_one() completes and
unblock ib_unregister_device() while ib_uverbs_close() is still waiting for
some of the hardware specific firmware commands to finish. The unblocked
ib_unregister_device() context can actually proceed and free the ib_device
structure. At the same time, in ib_uverbs_close() context the firmware command
may complete and may try to dereference ib_device pointer. But ib_device
pointer is a daggling pointer. Dereference to this pointer causes kernel to
invoke the page_fault handler. It fails to fetch the physical address and
causes kernel panic.

This patch adds two solutions as a remedy:

A) In ib_uverbs_close() context a NULL pointer check on dev->ib_dev pointer is
   added. The check is under a srcu_read_lock. If dev->ib_dev is NULL, the
   check prevents ib_uverbs_close() to enter into ib_uverbs_cleanup_ucontext()
   if ib_uverbs_remove_one has already started. If dev->ib_dev is not NULL,
   ib_uverbs_close() will continue as it is today.

With solution 'A' in place, it is still possible that after reading dev->ib_dev
NULL ib_uverbs_close() context go ahaed and put reference to
ib_uverbs_release_file, even before ib_uverbs_remove_one() reaches to this file
pointer traversing the entire file list one by one. Thus, again to synchronize
these two independent contexts we add solution 'B'

B) If ib_uverbs_close() context reads dev->ib_dev as NULL then, drop the
   srcu_read_lock() and wait for ib_uverbs_remove_one() context to reach
   to the stage where all the resources attached to this file pointer are
   freed. Now, allow ib_uverbs_close() context to put the reference of
   ib_uverbs_release_file. This behaviour is achived with the help of a
   completion signaling.

CC: Yishai Hadas <yishaih@mellanox.com>

Reviewed-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
---
 drivers/infiniband/core/uverbs.h      |  1 +
 drivers/infiniband/core/uverbs_main.c | 16 +++++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

Comments

Jason Gunthorpe March 12, 2016, 8:45 p.m. UTC | #1
On Sat, Mar 12, 2016 at 10:18:47AM -0500, Devesh Sharma wrote:
> CC: Yishai Hadas <yishaih@mellanox.com>

I'm still pretty convinced this is wrong... But even still:

> @@ -954,6 +955,17 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
>  	struct ib_uverbs_file *file = filp->private_data;
>  	struct ib_uverbs_device *dev = file->device;
>  	struct ib_ucontext *ucontext = NULL;
> +	struct ib_device *ib_dev;
> +	int srcu_key;
> +
> +	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
> +	ib_dev = srcu_dereference(dev->ib_dev,
> +				  &dev->disassociate_srcu);
> +	if (!ib_dev) {
> +		srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
> +		wait_for_completion(&file->fcomp);
> +		goto out;

This jumps over this:

>  	if (file->async_file)
>  		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);

Which doesn't seem right.

As I've said, I'm not sure how this is any different from using
lists_mutex. The wait_for_completion will still block and deadlock if
ib_uverbs_close is entered during the disassociate flow.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Devesh Sharma March 14, 2016, 5:40 a.m. UTC | #2
On Sun, Mar 13, 2016 at 2:15 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
>
> On Sat, Mar 12, 2016 at 10:18:47AM -0500, Devesh Sharma wrote:
> > CC: Yishai Hadas <yishaih@mellanox.com>
>
> I'm still pretty convinced this is wrong... But even still:
>
> > @@ -954,6 +955,17 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
> >       struct ib_uverbs_file *file = filp->private_data;
> >       struct ib_uverbs_device *dev = file->device;
> >       struct ib_ucontext *ucontext = NULL;
> > +     struct ib_device *ib_dev;
> > +     int srcu_key;
> > +
> > +     srcu_key = srcu_read_lock(&dev->disassociate_srcu);
> > +     ib_dev = srcu_dereference(dev->ib_dev,
> > +                               &dev->disassociate_srcu);
> > +     if (!ib_dev) {
> > +             srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
> > +             wait_for_completion(&file->fcomp);
> > +             goto out;
>
> This jumps over this:

I am not sure, to my mind kref_put(...,ib_uverbs_release_file) should
also be skipped. Because,
by putting an wait_for_completion(), this context is effectively
waiting for ib_uverbs_cleanup_ucontext
to finish cleaning up this file pointer. if the other thread is
cleaning up, why do I need to put the kerf again?

>
> >       if (file->async_file)
> >               kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
>
> Which doesn't seem right.
>
> As I've said, I'm not sure how this is any different from using
> lists_mutex. The wait_for_completion will still block and deadlock if
> ib_uverbs_close is entered during the disassociate flow.

Taking list-mutex is not stopping ib_dev pointer to become NULL, while
if we split the mutex
and wait_for_completion(), then effectively we are trying to sync
ib_uverb_close() and
remove_one() in such a way that no ib_uverb_close context hit ib_dev = NULL

I hope I was able to put the situation in simpler words?

Can you please explain how it can lead to a deadlock?


>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 14, 2016, 5:48 p.m. UTC | #3
On Mon, Mar 14, 2016 at 11:10:33AM +0530, Devesh Sharma wrote:
> On Sun, Mar 13, 2016 at 2:15 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> >
> > On Sat, Mar 12, 2016 at 10:18:47AM -0500, Devesh Sharma wrote:
> > > CC: Yishai Hadas <yishaih@mellanox.com>
> >
> > I'm still pretty convinced this is wrong... But even still:
> >
> > > @@ -954,6 +955,17 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
> > >       struct ib_uverbs_file *file = filp->private_data;
> > >       struct ib_uverbs_device *dev = file->device;
> > >       struct ib_ucontext *ucontext = NULL;
> > > +     struct ib_device *ib_dev;
> > > +     int srcu_key;
> > > +
> > > +     srcu_key = srcu_read_lock(&dev->disassociate_srcu);
> > > +     ib_dev = srcu_dereference(dev->ib_dev,
> > > +                               &dev->disassociate_srcu);
> > > +     if (!ib_dev) {
> > > +             srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
> > > +             wait_for_completion(&file->fcomp);
> > > +             goto out;
> >
> > This jumps over this:
> 
> I am not sure, to my mind kref_put(...,ib_uverbs_release_file) should
> also be skipped.

I am talking about ib_uverbs_release_event_file

> Because, by putting an wait_for_completion(), this context is
> effectively waiting for ib_uverbs_cleanup_ucontext to finish
> cleaning up this file pointer. if the other thread is cleaning up,
> why do I need to put the kerf again?

The other context doesn't do a balancing kref_put(..,ib_uverbs_release_event_file).

> > As I've said, I'm not sure how this is any different from using
> > lists_mutex. The wait_for_completion will still block and deadlock if
> > ib_uverbs_close is entered during the disassociate flow.
> 
> Taking list-mutex is not stopping ib_dev pointer to become NULL, while
> if we split the mutex

I don't think you understand the problem. ib_uverbs_device->ib_dev can
be NULL just fine. In fact, it is always NULL when
ib_uverbs_free_hw_resources calls ib_uverbs_cleanup_ucontext - that is
obviously fine (or if it isn't fine today, there is yet another bug).

The issue appears to be that ib_uverbs_free_hw_resources does not wait
for ib_uverbs_cleanup_ucontext to complete before it goes ahead and
wrecks the ib_dev, resulting in use-after-free/etc on copies of ib_dev
pointer that are used by the still running ib_uverbs_free_hw_resources.

> and wait_for_completion(), then effectively we are trying to sync
> ib_uverb_close() and
> remove_one() in such a way that no ib_uverb_close context hit ib_dev = NULL

No, that is the wrong problem statement and wrong solution.

The solution is to strong fence ib_uverbs_cleanup_ucontext so that
ib_uverbs_free_hw_resources does not exit until it is completed,
either by its thread or in the close thread.

I prefer a mutex, but perhaps there are other ways to build the
fence (eg uverbs_dev->refcount springs to mind)

> Can you please explain how it can lead to a deadlock?

Yishai's note outlines it:

		/* We must release the mutex before going ahead and calling
		 * disassociate_ucontext. disassociate_ucontext might end up
		 * indirectly calling uverbs_close, for example due to freeing
		 * the resources (e.g mmput).

Meaning when we call ib_uverbs_cleanup_ucontext from within
ib_uverbs_free_hw_resources it may recurse down into ib_uverbs_close.

If this happens, with your patch ib_uverbs_close will wait on the
completion in the same thread that is supposed to trigger it. That is
the same deadlock as would happen if the lists_mutex would be used.

The last patch I sent is much closer to what is needed, it is
just completely wrong to try and use the srcu for this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Devesh Sharma March 15, 2016, 9 a.m. UTC | #4
On Mon, Mar 14, 2016 at 11:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Mar 14, 2016 at 11:10:33AM +0530, Devesh Sharma wrote:
>> On Sun, Mar 13, 2016 at 2:15 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> >
>> > On Sat, Mar 12, 2016 at 10:18:47AM -0500, Devesh Sharma wrote:
>> > > CC: Yishai Hadas <yishaih@mellanox.com>
>> >
>> > I'm still pretty convinced this is wrong... But even still:
>> >
>> > > @@ -954,6 +955,17 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
>> > >       struct ib_uverbs_file *file = filp->private_data;
>> > >       struct ib_uverbs_device *dev = file->device;
>> > >       struct ib_ucontext *ucontext = NULL;
>> > > +     struct ib_device *ib_dev;
>> > > +     int srcu_key;
>> > > +
>> > > +     srcu_key = srcu_read_lock(&dev->disassociate_srcu);
>> > > +     ib_dev = srcu_dereference(dev->ib_dev,
>> > > +                               &dev->disassociate_srcu);
>> > > +     if (!ib_dev) {
>> > > +             srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
>> > > +             wait_for_completion(&file->fcomp);
>> > > +             goto out;
>> >
>> > This jumps over this:
>>
>> I am not sure, to my mind kref_put(...,ib_uverbs_release_file) should
>> also be skipped.
>
> I am talking about ib_uverbs_release_event_file

And I am adding ib_uverbs_release_file() as well. the other thread is already
cleaning it up and if ib_uverbs_close() reads it NULL then why to put krefs?

>
>> Because, by putting an wait_for_completion(), this context is
>> effectively waiting for ib_uverbs_cleanup_ucontext to finish
>> cleaning up this file pointer. if the other thread is cleaning up,
>> why do I need to put the kerf again?
>
> The other context doesn't do a balancing kref_put(..,ib_uverbs_release_event_file).

Actually it does, in the very next while loop on event_file list.

>
>> > As I've said, I'm not sure how this is any different from using
>> > lists_mutex. The wait_for_completion will still block and deadlock if
>> > ib_uverbs_close is entered during the disassociate flow.
>>
>> Taking list-mutex is not stopping ib_dev pointer to become NULL, while
>> if we split the mutex
>
> I don't think you understand the problem. ib_uverbs_device->ib_dev can
> be NULL just fine. In fact, it is always NULL when
> ib_uverbs_free_hw_resources calls ib_uverbs_cleanup_ucontext - that is
> obviously fine (or if it isn't fine today, there is yet another bug).

Okay, I meant it is being freed. before entring
ib_uverb_free_hw_resource() it is obviously set to NULL
which is well understood.

>
> The issue appears to be that ib_uverbs_free_hw_resources does not wait
> for ib_uverbs_cleanup_ucontext to complete before it goes ahead and
> wrecks the ib_dev, resulting in use-after-free/etc on copies of ib_dev
> pointer that are used by the still running ib_uverbs_free_hw_resources.

Exactly, that is the real problem. I have explained it in the description of V4.

>
>> and wait_for_completion(), then effectively we are trying to sync
>> ib_uverb_close() and
>> remove_one() in such a way that no ib_uverb_close context hit ib_dev = NULL
>
> No, that is the wrong problem statement and wrong solution.
>
> The solution is to strong fence ib_uverbs_cleanup_ucontext so that
> ib_uverbs_free_hw_resources does not exit until it is completed,
> either by its thread or in the close thread.
>
> I prefer a mutex, but perhaps there are other ways to build the
> fence (eg uverbs_dev->refcount springs to mind)

uverbs_dev->refcount is already there, we can choose to wait for
ref_count to become zero

static void ib_uverbs_remove_one(struct ib_device *device, void *client_data)
.
.
.
.
.
                rcu_assign_pointer(uverbs_dev->ib_dev, NULL);
                ib_uverbs_free_hw_resources(uverbs_dev, device);
                wait_clients = 0;
        }

wait for zero here instead of dec_and_test, I think things will
automatically fall in place after that

        if (atomic_dec_and_test(&uverbs_dev->refcount))
                ib_uverbs_comp_dev(uverbs_dev);
        if (wait_clients)
                wait_for_completion(&uverbs_dev->comp);
        kobject_put(&uverbs_dev->kobj);
}

>
>> Can you please explain how it can lead to a deadlock?
>
> Yishai's note outlines it:
>
>                 /* We must release the mutex before going ahead and calling
>                  * disassociate_ucontext. disassociate_ucontext might end up
>                  * indirectly calling uverbs_close, for example due to freeing
>                  * the resources (e.g mmput).
>
> Meaning when we call ib_uverbs_cleanup_ucontext from within
> ib_uverbs_free_hw_resources it may recurse down into ib_uverbs_close.
>
> If this happens, with your patch ib_uverbs_close will wait on the
> completion in the same thread that is supposed to trigger it. That is
> the same deadlock as would happen if the lists_mutex would be used.
>
> The last patch I sent is much closer to what is needed, it is
> just completely wrong to try and use the srcu for this.
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Devesh Sharma March 15, 2016, 9:03 a.m. UTC | #5
Sorry, I pressed send too early, just one more point I wanted to make,
listed below

On Tue, Mar 15, 2016 at 2:30 PM, Devesh Sharma
<devesh.sharma@broadcom.com> wrote:
> On Mon, Mar 14, 2016 at 11:18 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
>> On Mon, Mar 14, 2016 at 11:10:33AM +0530, Devesh Sharma wrote:
>>> On Sun, Mar 13, 2016 at 2:15 AM, Jason Gunthorpe
>>> <jgunthorpe@obsidianresearch.com> wrote:
>>> >
>>> > On Sat, Mar 12, 2016 at 10:18:47AM -0500, Devesh Sharma wrote:
>>> > > CC: Yishai Hadas <yishaih@mellanox.com>
>>> >
>>> > I'm still pretty convinced this is wrong... But even still:
>>> >
>>> > > @@ -954,6 +955,17 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
>>> > >       struct ib_uverbs_file *file = filp->private_data;
>>> > >       struct ib_uverbs_device *dev = file->device;
>>> > >       struct ib_ucontext *ucontext = NULL;
>>> > > +     struct ib_device *ib_dev;
>>> > > +     int srcu_key;
>>> > > +
>>> > > +     srcu_key = srcu_read_lock(&dev->disassociate_srcu);
>>> > > +     ib_dev = srcu_dereference(dev->ib_dev,
>>> > > +                               &dev->disassociate_srcu);
>>> > > +     if (!ib_dev) {
>>> > > +             srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
>>> > > +             wait_for_completion(&file->fcomp);
>>> > > +             goto out;
>>> >
>>> > This jumps over this:
>>>
>>> I am not sure, to my mind kref_put(...,ib_uverbs_release_file) should
>>> also be skipped.
>>
>> I am talking about ib_uverbs_release_event_file
>
> And I am adding ib_uverbs_release_file() as well. the other thread is already
> cleaning it up and if ib_uverbs_close() reads it NULL then why to put krefs?
>
>>
>>> Because, by putting an wait_for_completion(), this context is
>>> effectively waiting for ib_uverbs_cleanup_ucontext to finish
>>> cleaning up this file pointer. if the other thread is cleaning up,
>>> why do I need to put the kerf again?
>>
>> The other context doesn't do a balancing kref_put(..,ib_uverbs_release_event_file).
>
> Actually it does, in the very next while loop on event_file list.
>
>>
>>> > As I've said, I'm not sure how this is any different from using
>>> > lists_mutex. The wait_for_completion will still block and deadlock if
>>> > ib_uverbs_close is entered during the disassociate flow.
>>>
>>> Taking list-mutex is not stopping ib_dev pointer to become NULL, while
>>> if we split the mutex
>>
>> I don't think you understand the problem. ib_uverbs_device->ib_dev can
>> be NULL just fine. In fact, it is always NULL when
>> ib_uverbs_free_hw_resources calls ib_uverbs_cleanup_ucontext - that is
>> obviously fine (or if it isn't fine today, there is yet another bug).
>
> Okay, I meant it is being freed. before entring
> ib_uverb_free_hw_resource() it is obviously set to NULL
> which is well understood.
>
>>
>> The issue appears to be that ib_uverbs_free_hw_resources does not wait
>> for ib_uverbs_cleanup_ucontext to complete before it goes ahead and
>> wrecks the ib_dev, resulting in use-after-free/etc on copies of ib_dev
>> pointer that are used by the still running ib_uverbs_free_hw_resources.
>
> Exactly, that is the real problem. I have explained it in the description of V4.
>
>>
>>> and wait_for_completion(), then effectively we are trying to sync
>>> ib_uverb_close() and
>>> remove_one() in such a way that no ib_uverb_close context hit ib_dev = NULL
>>
>> No, that is the wrong problem statement and wrong solution.
>>
>> The solution is to strong fence ib_uverbs_cleanup_ucontext so that
>> ib_uverbs_free_hw_resources does not exit until it is completed,
>> either by its thread or in the close thread.
>>
>> I prefer a mutex, but perhaps there are other ways to build the
>> fence (eg uverbs_dev->refcount springs to mind)
>
> uverbs_dev->refcount is already there, we can choose to wait for
> ref_count to become zero
>
> static void ib_uverbs_remove_one(struct ib_device *device, void *client_data)
> .
> .
> .
> .
> .
>                 rcu_assign_pointer(uverbs_dev->ib_dev, NULL);
>                 ib_uverbs_free_hw_resources(uverbs_dev, device);
>                 wait_clients = 0;
>         }
>
> wait for zero here instead of dec_and_test, I think things will
> automatically fall in place after that

Off-course ib_uverb_close() will have to take the ref-count on entry and
drop on exit.

>
>         if (atomic_dec_and_test(&uverbs_dev->refcount))
>                 ib_uverbs_comp_dev(uverbs_dev);
>         if (wait_clients)
>                 wait_for_completion(&uverbs_dev->comp);
>         kobject_put(&uverbs_dev->kobj);
> }
>
>>
>>> Can you please explain how it can lead to a deadlock?
>>
>> Yishai's note outlines it:
>>
>>                 /* We must release the mutex before going ahead and calling
>>                  * disassociate_ucontext. disassociate_ucontext might end up
>>                  * indirectly calling uverbs_close, for example due to freeing
>>                  * the resources (e.g mmput).
>>
>> Meaning when we call ib_uverbs_cleanup_ucontext from within
>> ib_uverbs_free_hw_resources it may recurse down into ib_uverbs_close.
>>
>> If this happens, with your patch ib_uverbs_close will wait on the
>> completion in the same thread that is supposed to trigger it. That is
>> the same deadlock as would happen if the lists_mutex would be used.
>>
>> The last patch I sent is much closer to what is needed, it is
>> just completely wrong to try and use the srcu for this.
>>
>> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 15, 2016, 8:31 p.m. UTC | #6
On Tue, Mar 15, 2016 at 02:30:43PM +0530, Devesh Sharma wrote:
> >> I am not sure, to my mind kref_put(...,ib_uverbs_release_file) should
> >> also be skipped.
> >
> > I am talking about ib_uverbs_release_event_file
> 
> And I am adding ib_uverbs_release_file() as well. the other thread is already
> cleaning it up and if ib_uverbs_close() reads it NULL then why to put krefs?

> > The other context doesn't do a balancing kref_put(..,ib_uverbs_release_event_file).
> 
> Actually it does, in the very next while loop on event_file list.

I really don't understand your comments.

Maybe you can find some code snippits to explain this.

The krefs in ib_uverbs_event_close are unconditional. If that isn't
right, then there is another bug that needs to be explained. It sure
looks right to me as is.

The patch makes them conditional without any matching change
elsewhere, so it just simply must be wrong.

> > I prefer a mutex, but perhaps there are other ways to build the
> > fence (eg uverbs_dev->refcount springs to mind)
> 
> uverbs_dev->refcount is already there, we can choose to wait for
> ref_count to become zero

> wait for zero here instead of dec_and_test, I think things will
> automatically fall in place after that

More than that would be needed, the refcount is currently always being
held for the full life time of the ib_uverbs_device and the wait is
conditional. You'd have to restructure things so the wait becomes
unconditional and the refcount is conditionally held across the proper
times - eg only during close for the disassociate_ucontext mode.

Note sure this is prettier than a new mutex...

The fundamental thing is that the ib_uverbs_free_hw_resources thread
must wait for ib_uverbs_close's ib_uverbs_cleanup_ucontext to finish,
if necessary, not the other way around.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Devesh Sharma March 17, 2016, 4:08 p.m. UTC | #7
On Wed, Mar 16, 2016 at 2:01 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
>
> The patch makes them conditional without any matching change
> elsewhere, so it just simply must be wrong.
>
> > > I prefer a mutex, but perhaps there are other ways to build the
> > > fence (eg uverbs_dev->refcount springs to mind)
> >
> > uverbs_dev->refcount is already there, we can choose to wait for
> > ref_count to become zero
>
> > wait for zero here instead of dec_and_test, I think things will
> > automatically fall in place after that
>
> More than that would be needed, the refcount is currently always being
> held for the full life time of the ib_uverbs_device and the wait is
> conditional. You'd have to restructure things so the wait becomes
> unconditional and the refcount is conditionally held across the proper
> times - eg only during close for the disassociate_ucontext mode.

Yes, I also have pretty much the same understanding.

>
> Note sure this is prettier than a new mutex...

To my mind mutex is *not* solving the problem completely unless we
make it a coarser grained lock. The possible deadlock problem still
lingers around it.

>
> The fundamental thing is that the ib_uverbs_free_hw_resources thread
> must wait for ib_uverbs_close's ib_uverbs_cleanup_ucontext to finish,
> if necessary, not the other way around.

Makes sense.

>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 17, 2016, 4:12 p.m. UTC | #8
On Thu, Mar 17, 2016 at 09:38:30PM +0530, Devesh Sharma wrote:

> To my mind mutex is *not* solving the problem completely unless we
> make it a coarser grained lock. The possible deadlock problem still
> lingers around it.

Review the last version I sent, with this statement in mind:

> > The fundamental thing is that the ib_uverbs_free_hw_resources thread
> > must wait for ib_uverbs_close's ib_uverbs_cleanup_ucontext to finish,
> > if necessary, not the other way around.

It sure doesn't look like it can deadlock...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Devesh Sharma March 17, 2016, 4:31 p.m. UTC | #9
On Thu, Mar 17, 2016 at 9:42 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Mar 17, 2016 at 09:38:30PM +0530, Devesh Sharma wrote:
>
>> To my mind mutex is *not* solving the problem completely unless we
>> make it a coarser grained lock. The possible deadlock problem still
>> lingers around it.
>
> Review the last version I sent, with this statement in mind:

I am sorry I lost the track of it, which one you are point..we have
been discussion for a quite some time now!

>
>> > The fundamental thing is that the ib_uverbs_free_hw_resources thread
>> > must wait for ib_uverbs_close's ib_uverbs_cleanup_ucontext to finish,
>> > if necessary, not the other way around.
>
> It sure doesn't look like it can deadlock...
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 612ccfd..94a7339 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -121,6 +121,7 @@  struct ib_uverbs_file {
 	struct ib_event_handler			event_handler;
 	struct ib_uverbs_event_file	       *async_file;
 	struct list_head			list;
+	struct completion			fcomp;
 	int					is_closed;
 };
 
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 39680ae..da1fed2 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -928,6 +928,7 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	file->async_file = NULL;
 	kref_init(&file->ref);
 	mutex_init(&file->mutex);
+	init_completion(&file->fcomp);
 
 	filp->private_data = file;
 	kobject_get(&dev->kobj);
@@ -954,6 +955,17 @@  static int ib_uverbs_close(struct inode *inode, struct file *filp)
 	struct ib_uverbs_file *file = filp->private_data;
 	struct ib_uverbs_device *dev = file->device;
 	struct ib_ucontext *ucontext = NULL;
+	struct ib_device *ib_dev;
+	int srcu_key;
+
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	ib_dev = srcu_dereference(dev->ib_dev,
+				  &dev->disassociate_srcu);
+	if (!ib_dev) {
+		srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
+		wait_for_completion(&file->fcomp);
+		goto out;
+	}
 
 	mutex_lock(&file->device->lists_mutex);
 	ucontext = file->ucontext;
@@ -965,10 +977,11 @@  static int ib_uverbs_close(struct inode *inode, struct file *filp)
 	mutex_unlock(&file->device->lists_mutex);
 	if (ucontext)
 		ib_uverbs_cleanup_ucontext(file, ucontext);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
 	if (file->async_file)
 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
-
+out:
 	kref_put(&file->ref, ib_uverbs_release_file);
 	kobject_put(&dev->kobj);
 
@@ -1199,6 +1212,7 @@  static void ib_uverbs_free_hw_resources(struct ib_uverbs_device *uverbs_dev,
 		}
 
 		mutex_lock(&uverbs_dev->lists_mutex);
+		complete(&file->fcomp);
 		kref_put(&file->ref, ib_uverbs_release_file);
 	}