diff mbox

[for-next,V5,3/5] IB/uverbs: Enable device removal when there are active user space applications

Message ID 1434984438-21733-4-git-send-email-yishaih@mellanox.com (mailing list archive)
State Superseded
Headers show

Commit Message

Yishai Hadas June 22, 2015, 2:47 p.m. UTC
Enables the uverbs_remove_one to succeed despite the fact that there are
running IB applications working with the given ib device.  This functionality
enables a HW device to be unbind/reset despite the fact that there are running
user space applications using it.

It exposes a new IB kernel API named 'disassociate_ucontext' which lets a
driver detaching its HW resources from a given user context without
crashing/terminating the application. In case a driver implemented the above
API and registered with ib_uverb there will be no dependency between its device
to its uverbs_device. Upon calling remove_one of ib_uverbs the call should
return after disassociating the open HW resources without waiting to clients
disconnecting. In case driver didn't implement this API there will be no change
to current behaviour and uverbs_remove_one will return only when last client
has disconnected and reference count on uverbs device became 0.

In case the lower driver device was removed any application will continue
working over some zombie HCA, further calls will ended with an immediate error.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
---
 drivers/infiniband/core/uverbs.h      |    9 +-
 drivers/infiniband/core/uverbs_main.c |  370 ++++++++++++++++++++++++++------
 include/rdma/ib_verbs.h               |    1 +
 3 files changed, 309 insertions(+), 71 deletions(-)

Comments

Jason Gunthorpe June 24, 2015, 6:25 p.m. UTC | #1
On Mon, Jun 22, 2015 at 05:47:16PM +0300, Yishai Hadas wrote:
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -137,7 +137,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
>  	struct ib_uverbs_device *dev =
>  		container_of(ref, struct ib_uverbs_device, ref);
>  
> -	complete(&dev->comp);
> +	if (!dev->ib_dev) {
> +		cleanup_srcu_struct(&dev->disassociate_srcu);
> +		kfree(dev);
> +	} else {
> +		complete(&dev->comp);
> +	}

Oy.. It is so gross to see a kref now being simultaneously used for
actual memory accounting and also for general reference counting.

It is also locked wrong, for instance:

@@ -792,13 +889,18 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 err:
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 	kref_put(&dev->ref, ib_uverbs_release_dev);

Is not holding the RCU lock while ib_uverbs_release_dev is reading
ib_dev. The barriers in kref are not strong enough to guarentee the RCU
protected data will be visible. (remember when I asked if you checked
all of these?)

Obviously you can't hold the disassociate_srcu and call kref_put, so
maybe grabbing it in release_dev would work. I didn't look that
closely.

But really, don't make a kref do two kinds of things, it just doesn't
make any sense. You should split it into a proper memory ownership
tracking kref that always does kfree and a counter used to cause
complete().

The rest looked OK now..

> +                       /* The barriers built into wait_event_interruptible()
> +                        * and wake_up() make this ib_dev check RCU protected
> +                        */

No..

 The barriers built into wait_event_interruptible() and wake_up()
 guarentee this will see the null set without using RCU

> +	if (device->disassociate_ucontext) {
> +		/* We disassociate HW resources and immediately returning, not
> +		 * pending to active userspace clients. Upon returning ib_device
> +		 * may be freed internally and is not valid any more.
> +		 * uverbs_device is still available, when all clients close
> +		 * their files, the uverbs device ref count will be zero and its
> +		 * resources will be freed.
> +		 * Note: At that step no more files can be opened on that cdev
> +		 * as it was deleted, however active clients can still issue
> +		 * commands and close their open files.
> +		 */

Clean up the grammer..

We disassociate HW resources and immediately return. Userspace will
see a EIO errno for all future access. Upon returning, ib_device may be
freed internally and is not valid any more.  uverbs_device is still
available until all clients close their files, then the uverbs device
ref count will be zero and its resources will be freed.  Note: At this
point no more files can be opened since the cdev was deleted, however
active clients can still issue commands and close their open files.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas June 25, 2015, 1:51 p.m. UTC | #2
On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
> On Mon, Jun 22, 2015 at 05:47:16PM +0300, Yishai Hadas wrote:
>> +++ b/drivers/infiniband/core/uverbs_main.c
>> @@ -137,7 +137,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
>>   	struct ib_uverbs_device *dev =
>>   		container_of(ref, struct ib_uverbs_device, ref);
>>
>> -	complete(&dev->comp);
>> +	if (!dev->ib_dev) {
>> +		cleanup_srcu_struct(&dev->disassociate_srcu);
>> +		kfree(dev);
>> +	} else {
>> +		complete(&dev->comp);
>> +	}
>
> Oy.. It is so gross to see a kref now being simultaneously used for
> actual memory accounting and also for general reference counting.
>
> It is also locked wrong, for instance:
Not correct, see below.

>
> @@ -792,13 +889,18 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
>   err:
> +	mutex_unlock(&dev->lists_mutex);
> +	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
>   	kref_put(&dev->ref, ib_uverbs_release_dev);
>
> Is not holding the RCU lock while ib_uverbs_release_dev is reading
> ib_dev. The barriers in kref are not strong enough to guarentee the RCU
> protected data will be visible. (remember when I asked if you checked
> all of these?)
>

This is not a locking problem, the only option that here the reference 
count becomes 0 is if ib_uverbs_remove_one was previously called and 
decreased the reference count that was taken upon load. However, it was 
done after that rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called 
so the check whether if (!dev->ib_dev) is fully protected and can't race 
with HW removal flow.


> Obviously you can't hold the disassociate_srcu and call kref_put, so
> maybe grabbing it in release_dev would work. I didn't look that
> closely.
>
> But really, don't make a kref do two kinds of things, it just doesn't
> make any sense. You should split it into a proper memory ownership
> tracking kref that always does kfree and a counter used to cause
> complete().

The kref is used to manage the uverbs_dev allocation, the internal code 
in ib_uverbs_release_dev depends on the state. Usually the natural place 
to free the memory is as part of the release function as done in other 
kernel places. In case that ib_device was previously removed it can be 
safely freed here as it's called when the last client disconnected, this 
logic is introduced by this patch. In case there is a need to wait 
clients as the driver doesn't support HW device removal the free can't 
be done internally but must be done externally in ib_uverbs_remove_one 
when last client disconnected and complete should be used instead. As of 
I believe that we can stay with only one kref that manage the uverbs_dev 
which is safe as I pointed above.

>
> The rest looked OK now..
Thanks

>
>> +                       /* The barriers built into wait_event_interruptible()
>> +                        * and wake_up() make this ib_dev check RCU protected
>> +                        */
>
> No..
>
>   The barriers built into wait_event_interruptible() and wake_up()
>   guarentee this will see the null set without using RCU
>
Will fix the comment

>> +	if (device->disassociate_ucontext) {
>> +		/* We disassociate HW resources and immediately returning, not
>> +		 * pending to active userspace clients. Upon returning ib_device
>> +		 * may be freed internally and is not valid any more.
>> +		 * uverbs_device is still available, when all clients close
>> +		 * their files, the uverbs device ref count will be zero and its
>> +		 * resources will be freed.
>> +		 * Note: At that step no more files can be opened on that cdev
>> +		 * as it was deleted, however active clients can still issue
>> +		 * commands and close their open files.
>> +		 */
>
> Clean up the grammer..
>
> We disassociate HW resources and immediately return. Userspace will
> see a EIO errno for all future access. Upon returning, ib_device may be
> freed internally and is not valid any more.  uverbs_device is still
> available until all clients close their files, then the uverbs device
> ref count will be zero and its resources will be freed.  Note: At this
> point no more files can be opened since the cdev was deleted, however
> active clients can still issue commands and close their open files.

OK
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe June 25, 2015, 5:09 p.m. UTC | #3
On Thu, Jun 25, 2015 at 04:51:49PM +0300, Yishai Hadas wrote:
> On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
> >Is not holding the RCU lock while ib_uverbs_release_dev is reading
> >ib_dev. The barriers in kref are not strong enough to guarentee the RCU
> >protected data will be visible. (remember when I asked if you checked
> >all of these?)
> 
> This is not a locking problem, the only option that here the reference count
> becomes 0 is if ib_uverbs_remove_one was previously called and decreased the
> reference count that was taken upon load. However, it was done after that
> rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called so the check whether
> if (!dev->ib_dev) is fully protected and can't race with HW removal flow.

The problem with RCU is not to do with instruction concurrancy, you
need to convince yourself the unlocked accesses have *data* coherency,
that is what rcu_derference and friends are all about.

So, looked too briefly yesterday (sorry, busy), but it turns out that
atomic_sub_return guarentees a full smp_mb(), so the data barrier in
kref_put *IS* strong enough for this use. Thus it is OK.

> The kref is used to manage the uverbs_dev allocation, the internal code in
> ib_uverbs_release_dev depends on the state. Usually the natural place to
> free the memory is as part of the release function as done in other kernel
> places. In case that ib_device was previously removed it can be safely freed
> here as it's called when the last client disconnected, this logic is
> introduced by this patch. In case there is a need to wait clients as the
> driver doesn't support HW device removal the free can't be done internally
> but must be done externally in ib_uverbs_remove_one when last client
> disconnected and complete should be used instead. As of I believe that we
> can stay with only one kref that manage the uverbs_dev which is safe as I
> pointed above.

Many other places in the kernel split the 'can i complete remove'
counter (ie the active count) from the kref. It just makes things
simpler and cleaner

When you changed the flow to do both things, it breaks the error
handling, ie this:

@@ -924,6 +1062,13 @@  static void ib_uverbs_add_one(struct ib_device *device)
	uverbs_dev = kzalloc(sizeof *uverbs_dev, GFP_KERNEL);
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	mutex_init(&uverbs_dev->lists_mutex);
+	ret = init_srcu_struct(&uverbs_dev->disassociate_srcu);
+	if (ret)
+		goto err_init;

+err_init:
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
 	wait_for_completion(&uverbs_dev->comp);
 	kfree(uverbs_dev);

Is now a double kfree.

So, to fix this, we need to have two different uses of kput, one that
is protected by the implicit locking of add_one/remove_one, and one
that is usable otherwise.

The former should always be the same, and wrapped in a function:

// ONLY callable from remove_one/add_one
static void free_uverbs_dev(..)
{
	bool need_wait = uverbs_dev->ib_dev;
	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
	if (need_wait) {
		wait_for_completion(&uverbs_dev->comp);
 		cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
		kfree(uverbs_dev);
	}
}

Workable, but it sure makes the kref weird and ugly and abnormal.

99% of the time, seeing a dereference after a kref would be an error,
only when a kref is pretending to be an atomic does it matter. I am of
the opinion that abusing kref for something that is not memory
reference counting is wrong, it makes the code hard to audit and
understand because everyone *expects* standard rules to be followed
with kref.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas June 29, 2015, 9:33 p.m. UTC | #4
On 6/25/2015 8:09 PM, Jason Gunthorpe wrote:
> On Thu, Jun 25, 2015 at 04:51:49PM +0300, Yishai Hadas wrote:
>> On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
>>> Is not holding the RCU lock while ib_uverbs_release_dev is reading
>>> ib_dev. The barriers in kref are not strong enough to guarentee the RCU
>>> protected data will be visible. (remember when I asked if you checked
>>> all of these?)
>>
>> This is not a locking problem, the only option that here the reference count
>> becomes 0 is if ib_uverbs_remove_one was previously called and decreased the
>> reference count that was taken upon load. However, it was done after that
>> rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called so the check whether
>> if (!dev->ib_dev) is fully protected and can't race with HW removal flow.
>
> The problem with RCU is not to do with instruction concurrancy, you
> need to convince yourself the unlocked accesses have *data* coherency,
> that is what rcu_derference and friends are all about.
>
> So, looked too briefly yesterday (sorry, busy), but it turns out that
> atomic_sub_return guarentees a full smp_mb(), so the data barrier in
> kref_put *IS* strong enough for this use. Thus it is OK.
>
>> The kref is used to manage the uverbs_dev allocation, the internal code in
>> ib_uverbs_release_dev depends on the state. Usually the natural place to
>> free the memory is as part of the release function as done in other kernel
>> places. In case that ib_device was previously removed it can be safely freed
>> here as it's called when the last client disconnected, this logic is
>> introduced by this patch. In case there is a need to wait clients as the
>> driver doesn't support HW device removal the free can't be done internally
>> but must be done externally in ib_uverbs_remove_one when last client
>> disconnected and complete should be used instead. As of I believe that we
>> can stay with only one kref that manage the uverbs_dev which is safe as I
>> pointed above.
>
> Many other places in the kernel split the 'can i complete remove'
> counter (ie the active count) from the kref. It just makes things
> simpler and cleaner

OK, will use 2 krefs one will control the complete and the second will 
manage the memory, V6 will have this change in.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index dc002a1..06c80ba 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -89,11 +89,15 @@  struct ib_uverbs_device {
 	int					num_comp_vectors;
 	struct completion			comp;
 	struct device			       *dev;
-	struct ib_device		       *ib_dev;
+	struct ib_device	__rcu	       *ib_dev;
 	int					devnum;
 	struct cdev			        cdev;
 	struct rb_root				xrcd_tree;
 	struct mutex				xrcd_tree_mutex;
+	struct srcu_struct			disassociate_srcu;
+	struct mutex				lists_mutex; /* protect lists */
+	struct list_head			uverbs_file_list;
+	struct list_head			uverbs_events_file_list;
 };
 
 struct ib_uverbs_event_file {
@@ -105,6 +109,7 @@  struct ib_uverbs_event_file {
 	wait_queue_head_t			poll_wait;
 	struct fasync_struct		       *async_queue;
 	struct list_head			event_list;
+	struct list_head			list;
 };
 
 struct ib_uverbs_file {
@@ -114,6 +119,8 @@  struct ib_uverbs_file {
 	struct ib_ucontext		       *ucontext;
 	struct ib_event_handler			event_handler;
 	struct ib_uverbs_event_file	       *async_file;
+	struct list_head			list;
+	int					is_closed;
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 6dea148..02a9b82 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -137,7 +137,12 @@  static void ib_uverbs_release_dev(struct kref *ref)
 	struct ib_uverbs_device *dev =
 		container_of(ref, struct ib_uverbs_device, ref);
 
-	complete(&dev->comp);
+	if (!dev->ib_dev) {
+		cleanup_srcu_struct(&dev->disassociate_srcu);
+		kfree(dev);
+	} else {
+		complete(&dev->comp);
+	}
 }
 
 static void ib_uverbs_release_event_file(struct kref *ref)
@@ -203,9 +208,6 @@  static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 {
 	struct ib_uobject *uobj, *tmp;
 
-	if (!context)
-		return 0;
-
 	context->closing = 1;
 
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
@@ -309,8 +311,15 @@  static void ib_uverbs_release_file(struct kref *ref)
 {
 	struct ib_uverbs_file *file =
 		container_of(ref, struct ib_uverbs_file, ref);
-
-	module_put(file->device->ib_dev->owner);
+	struct ib_device *ib_dev;
+	int srcu_key;
+
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (ib_dev && !ib_dev->disassociate_ucontext)
+		module_put(ib_dev->owner);
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
 	kref_put(&file->device->ref, ib_uverbs_release_dev);
 
 	kfree(file);
@@ -333,9 +342,18 @@  static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf,
 			return -EAGAIN;
 
 		if (wait_event_interruptible(file->poll_wait,
-					     !list_empty(&file->event_list)))
+					     (!list_empty(&file->event_list) ||
+			/* The barriers built into wait_event_interruptible()
+			 * and wake_up() make this ib_dev check RCU protected
+			 */
+					     !file->uverbs_file->device->ib_dev)))
 			return -ERESTARTSYS;
 
+		/* If device was disassociated and no event exists set an error */
+		if (list_empty(&file->event_list) &&
+		    !file->uverbs_file->device->ib_dev)
+			return -EIO;
+
 		spin_lock_irq(&file->lock);
 	}
 
@@ -398,8 +416,11 @@  static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_event_file *file = filp->private_data;
 	struct ib_uverbs_event *entry, *tmp;
+	int closed_already = 0;
 
+	mutex_lock(&file->uverbs_file->device->lists_mutex);
 	spin_lock_irq(&file->lock);
+	closed_already = file->is_closed;
 	file->is_closed = 1;
 	list_for_each_entry_safe(entry, tmp, &file->event_list, list) {
 		if (entry->counter)
@@ -407,9 +428,14 @@  static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 		kfree(entry);
 	}
 	spin_unlock_irq(&file->lock);
+	if (!closed_already) {
+		list_del(&file->list);
+		if (file->is_async)
+			ib_unregister_event_handler(&file->uverbs_file->
+				event_handler);
+	}
+	mutex_unlock(&file->uverbs_file->device->lists_mutex);
 
-	if (file->is_async)
-		ib_unregister_event_handler(&file->uverbs_file->event_handler);
 	kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
 	kref_put(&file->ref, ib_uverbs_release_event_file);
 
@@ -574,6 +600,11 @@  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 		goto err;
 
 	kref_get(&uverbs_file->ref);
+	mutex_lock(&uverbs_file->device->lists_mutex);
+	list_add_tail(&ev_file->list,
+		      &uverbs_file->device->uverbs_events_file_list);
+	mutex_unlock(&uverbs_file->device->lists_mutex);
+
 	if (is_async) {
 		WARN_ON(uverbs_file->async_file);
 		uverbs_file->async_file = ev_file;
@@ -633,12 +664,11 @@  static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 			     size_t count, loff_t *pos)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-	struct ib_device *ib_dev = file->device->ib_dev;
+	struct ib_device *ib_dev;
 	struct ib_uverbs_cmd_hdr hdr;
 	__u32 flags;
-
-	if (!ib_dev)
-		return -ENODEV;
+	int srcu_key;
+	ssize_t ret;
 
 	if (count < sizeof hdr)
 		return -EINVAL;
@@ -646,6 +676,14 @@  static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 	if (copy_from_user(&hdr, buf, sizeof hdr))
 		return -EFAULT;
 
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
+		goto out;
+	}
+
 	flags = (hdr.command &
 		 IB_USER_VERBS_CMD_FLAGS_MASK) >> IB_USER_VERBS_CMD_FLAGS_SHIFT;
 
@@ -653,26 +691,36 @@  static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		__u32 command;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_cmd_table) ||
-		    !uverbs_cmd_table[command])
-			return -EINVAL;
+		    !uverbs_cmd_table[command]) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (!file->ucontext &&
-		    command != IB_USER_VERBS_CMD_GET_CONTEXT)
-			return -EINVAL;
+		    command != IB_USER_VERBS_CMD_GET_CONTEXT) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(ib_dev->uverbs_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(ib_dev->uverbs_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (hdr.in_words * 4 != count)
-			return -EINVAL;
+		if (hdr.in_words * 4 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		return uverbs_cmd_table[command](file, ib_dev,
+		ret = uverbs_cmd_table[command](file, ib_dev,
 						 buf + sizeof(hdr),
 						 hdr.in_words * 4,
 						 hdr.out_words * 4);
@@ -683,51 +731,72 @@  static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		struct ib_uverbs_ex_cmd_hdr ex_hdr;
 		struct ib_udata ucore;
 		struct ib_udata uhw;
-		int err;
 		size_t written_count = count;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_ex_cmd_table) ||
-		    !uverbs_ex_cmd_table[command])
-			return -ENOSYS;
+		    !uverbs_ex_cmd_table[command]) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (!file->ucontext)
-			return -EINVAL;
+		if (!file->ucontext) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(ib_dev->uverbs_ex_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(ib_dev->uverbs_ex_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (count < (sizeof(hdr) + sizeof(ex_hdr)))
-			return -EINVAL;
+		if (count < (sizeof(hdr) + sizeof(ex_hdr))) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr)))
-			return -EFAULT;
+		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr))) {
+			ret = -EFAULT;
+			goto out;
+		}
 
 		count -= sizeof(hdr) + sizeof(ex_hdr);
 		buf += sizeof(hdr) + sizeof(ex_hdr);
 
-		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count)
-			return -EINVAL;
+		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (ex_hdr.cmd_hdr_reserved)
-			return -EINVAL;
+		if (ex_hdr.cmd_hdr_reserved) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (ex_hdr.response) {
-			if (!hdr.out_words && !ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (!hdr.out_words && !ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 
 			if (!access_ok(VERIFY_WRITE,
 				       (void __user *) (unsigned long) ex_hdr.response,
-				       (hdr.out_words + ex_hdr.provider_out_words) * 8))
-				return -EFAULT;
+				       (hdr.out_words + ex_hdr.provider_out_words) * 8)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		} else {
-			if (hdr.out_words || ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (hdr.out_words || ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 		}
 
 		INIT_UDATA_BUF_OR_NULL(&ucore, buf, (unsigned long) ex_hdr.response,
@@ -739,29 +808,43 @@  static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 				       ex_hdr.provider_in_words * 8,
 				       ex_hdr.provider_out_words * 8);
 
-		err = uverbs_ex_cmd_table[command](file,
+		ret = uverbs_ex_cmd_table[command](file,
 						   ib_dev,
 						   &ucore,
 						   &uhw);
-
-		if (err)
-			return err;
-
-		return written_count;
+		if (!ret)
+			ret = written_count;
+	} else {
+		ret = -ENOSYS;
 	}
 
-	return -ENOSYS;
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-	struct ib_device *ib_dev = file->device->ib_dev;
+	struct ib_device *ib_dev;
+	int ret = 0;
+	int srcu_key;
 
-	if (!ib_dev || !file->ucontext)
-		return -ENODEV;
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (!file->ucontext)
+		ret = -ENODEV;
 	else
-		return ib_dev->mmap(file->ucontext, vma);
+		ret = ib_dev->mmap(file->ucontext, vma);
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 /*
@@ -778,7 +861,10 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_device *dev;
 	struct ib_uverbs_file *file;
+	struct ib_device *ib_dev;
 	int ret;
+	int module_dependent;
+	int srcu_key;
 
 	dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev);
 	if (dev)
@@ -786,15 +872,34 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	else
 		return -ENXIO;
 
-	if (!try_module_get(dev->ib_dev->owner)) {
-		ret = -ENODEV;
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	mutex_lock(&dev->lists_mutex);
+	ib_dev = srcu_dereference(dev->ib_dev,
+				  &dev->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
 		goto err;
 	}
 
-	file = kmalloc(sizeof *file, GFP_KERNEL);
+	/* In case IB device supports disassociate ucontext, there is no hard
+	 * dependency between uverbs device and its low level device.
+	 */
+	module_dependent = !(ib_dev->disassociate_ucontext);
+
+	if (module_dependent) {
+		if (!try_module_get(ib_dev->owner)) {
+			ret = -ENODEV;
+			goto err;
+		}
+	}
+
+	file = kzalloc(sizeof(*file), GFP_KERNEL);
 	if (!file) {
 		ret = -ENOMEM;
-		goto err_module;
+		if (module_dependent)
+			goto err_module;
+
+		goto err;
 	}
 
 	file->device	 = dev;
@@ -804,13 +909,18 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	mutex_init(&file->mutex);
 
 	filp->private_data = file;
+	list_add_tail(&file->list, &dev->uverbs_file_list);
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
 	return nonseekable_open(inode, filp);
 
 err_module:
-	module_put(dev->ib_dev->owner);
+	module_put(ib_dev->owner);
 
 err:
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 	kref_put(&dev->ref, ib_uverbs_release_dev);
 	return ret;
 }
@@ -818,8 +928,18 @@  err:
 static int ib_uverbs_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-
-	ib_uverbs_cleanup_ucontext(file, file->ucontext);
+	struct ib_ucontext *ucontext = NULL;
+
+	mutex_lock(&file->device->lists_mutex);
+	ucontext = file->ucontext;
+	file->ucontext = NULL;
+	if (!file->is_closed) {
+		list_del(&file->list);
+		file->is_closed = 1;
+	}
+	mutex_unlock(&file->device->lists_mutex);
+	if (ucontext)
+		ib_uverbs_cleanup_ucontext(file, ucontext);
 
 	if (file->async_file)
 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
@@ -855,12 +975,21 @@  static struct ib_client uverbs_client = {
 static ssize_t show_ibdev(struct device *device, struct device_attribute *attr,
 			  char *buf)
 {
+	int ret = -ENODEV;
+	int srcu_key;
 	struct ib_uverbs_device *dev = dev_get_drvdata(device);
+	struct ib_device *ib_dev;
 
 	if (!dev)
 		return -ENODEV;
 
-	return sprintf(buf, "%s\n", dev->ib_dev->name);
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	ib_dev = srcu_dereference(dev->ib_dev, &dev->disassociate_srcu);
+	if (ib_dev)
+		ret = sprintf(buf, "%s\n", ib_dev->name);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
+
+	return ret;
 }
 static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
 
@@ -868,11 +997,19 @@  static ssize_t show_dev_abi_version(struct device *device,
 				    struct device_attribute *attr, char *buf)
 {
 	struct ib_uverbs_device *dev = dev_get_drvdata(device);
+	int ret = -ENODEV;
+	int srcu_key;
+	struct ib_device *ib_dev;
 
 	if (!dev)
 		return -ENODEV;
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	ib_dev = srcu_dereference(dev->ib_dev, &dev->disassociate_srcu);
+	if (ib_dev)
+		ret = sprintf(buf, "%d\n", ib_dev->uverbs_abi_ver);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
-	return sprintf(buf, "%d\n", dev->ib_dev->uverbs_abi_ver);
+	return ret;
 }
 static DEVICE_ATTR(abi_version, S_IRUGO, show_dev_abi_version, NULL);
 
@@ -912,6 +1049,7 @@  static void ib_uverbs_add_one(struct ib_device *device)
 	int devnum;
 	dev_t base;
 	struct ib_uverbs_device *uverbs_dev;
+	int ret;
 
 	if (!device->alloc_ucontext)
 		return;
@@ -924,6 +1062,13 @@  static void ib_uverbs_add_one(struct ib_device *device)
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	mutex_init(&uverbs_dev->lists_mutex);
+	ret = init_srcu_struct(&uverbs_dev->disassociate_srcu);
+	if (ret)
+		goto err_init;
+
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_file_list);
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_events_file_list);
 
 	spin_lock(&map_lock);
 	devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES);
@@ -944,7 +1089,7 @@  static void ib_uverbs_add_one(struct ib_device *device)
 	}
 	spin_unlock(&map_lock);
 
-	uverbs_dev->ib_dev           = device;
+	rcu_assign_pointer(uverbs_dev->ib_dev, device);
 	uverbs_dev->num_comp_vectors = device->num_comp_vectors;
 
 	cdev_init(&uverbs_dev->cdev, NULL);
@@ -980,15 +1125,81 @@  err_cdev:
 		clear_bit(devnum, overflow_map);
 
 err:
+	cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+
+err_init:
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
 	wait_for_completion(&uverbs_dev->comp);
 	kfree(uverbs_dev);
 	return;
 }
 
+static void ib_uverbs_free_hw_resources(struct ib_uverbs_device *uverbs_dev,
+					struct ib_device *ib_dev)
+{
+	struct ib_uverbs_file *file;
+	struct ib_uverbs_event_file *event_file;
+	struct ib_event event;
+
+	/* Pending running commands to terminate */
+	synchronize_srcu(&uverbs_dev->disassociate_srcu);
+	event.event = IB_EVENT_DEVICE_FATAL;
+	event.element.port_num = 0;
+	event.device = ib_dev;
+
+	mutex_lock(&uverbs_dev->lists_mutex);
+	while (!list_empty(&uverbs_dev->uverbs_file_list)) {
+		struct ib_ucontext *ucontext;
+
+		file = list_first_entry(&uverbs_dev->uverbs_file_list,
+					struct ib_uverbs_file, list);
+		file->is_closed = 1;
+		ucontext = file->ucontext;
+		list_del(&file->list);
+		file->ucontext = NULL;
+		kref_get(&file->ref);
+		mutex_unlock(&uverbs_dev->lists_mutex);
+		/* We must release the mutex before going ahead and calling
+		 * disassociate_ucontext. disassociate_ucontext might end up
+		 * indirectly calling uverbs_close, for example due to freeing
+		 * the resources (e.g mmput).
+		 */
+		ib_uverbs_event_handler(&file->event_handler, &event);
+		if (ucontext) {
+			ib_dev->disassociate_ucontext(ucontext);
+			ib_uverbs_cleanup_ucontext(file, ucontext);
+		}
+
+		mutex_lock(&uverbs_dev->lists_mutex);
+		kref_put(&file->ref, ib_uverbs_release_file);
+	}
+
+	while (!list_empty(&uverbs_dev->uverbs_events_file_list)) {
+		event_file = list_first_entry(&uverbs_dev->
+					      uverbs_events_file_list,
+					      struct ib_uverbs_event_file,
+					      list);
+		spin_lock_irq(&event_file->lock);
+		event_file->is_closed = 1;
+		spin_unlock_irq(&event_file->lock);
+
+		list_del(&event_file->list);
+		if (event_file->is_async) {
+			ib_unregister_event_handler(&event_file->uverbs_file->
+						    event_handler);
+			event_file->uverbs_file->event_handler.device = NULL;
+		}
+
+		wake_up_interruptible(&event_file->poll_wait);
+		kill_fasync(&event_file->async_queue, SIGIO, POLL_IN);
+	}
+	mutex_unlock(&uverbs_dev->lists_mutex);
+}
+
 static void ib_uverbs_remove_one(struct ib_device *device)
 {
 	struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client);
+	int wait_clients = 1;
 
 	if (!uverbs_dev)
 		return;
@@ -1002,9 +1213,28 @@  static void ib_uverbs_remove_one(struct ib_device *device)
 	else
 		clear_bit(uverbs_dev->devnum - IB_UVERBS_MAX_DEVICES, overflow_map);
 
+	if (device->disassociate_ucontext) {
+		/* We disassociate HW resources and immediately returning, not
+		 * pending to active userspace clients. Upon returning ib_device
+		 * may be freed internally and is not valid any more.
+		 * uverbs_device is still available, when all clients close
+		 * their files, the uverbs device ref count will be zero and its
+		 * resources will be freed.
+		 * Note: At that step no more files can be opened on that cdev
+		 * as it was deleted, however active clients can still issue
+		 * commands and close their open files.
+		 */
+		rcu_assign_pointer(uverbs_dev->ib_dev, NULL);
+		ib_uverbs_free_hw_resources(uverbs_dev, device);
+		wait_clients = 0;
+	}
+	/* ref count taken as part of add one is put back in both modes. */
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
-	wait_for_completion(&uverbs_dev->comp);
-	kfree(uverbs_dev);
+	if (wait_clients) {
+		wait_for_completion(&uverbs_dev->comp);
+		cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+		kfree(uverbs_dev);
+	}
 }
 
 static char *uverbs_devnode(struct device *dev, umode_t *mode)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 986fddb..e30552f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1724,6 +1724,7 @@  struct ib_device {
 	int			   (*destroy_flow)(struct ib_flow *flow_id);
 	int			   (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
 						      struct ib_mr_status *mr_status);
+	void			   (*disassociate_ucontext)(struct ib_ucontext *ibcontext);
 
 	struct ib_dma_mapping_ops   *dma_ops;