Message ID | 1438697008-26209-7-git-send-email-yishaih@mellanox.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Tue, Aug 04, 2015 at 05:03:28PM +0300, Yishai Hadas wrote: > Currently, IB/cma remove_one flow blocks until all user descriptor managed by > IB/ucma are released. This prevents hot-removal of IB devices. This patch > allows IB/cma to remove devices regardless of user space activity. Upon getting > the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources > for the given ucontext. The ucontext itself is still alive till its explicit > destroying by its creator. > > Running applications at that time will have some zombie device, further > operations may fail. I've never made it far enough to look at this patch before... It is Sean's stuff so he is best qualified to comment on it. The rest of the series is OK now Sean. > +static void ucma_close_id(struct work_struct *work) > +{ > + struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); > + > + /* Fence to ensure that ctx->closing was seen by all > + * ucma_get_ctx running > + */ > + mutex_lock(&mut); > + mutex_unlock(&mut); Uh, no. That isn't how to use mutex's. I don't really see any consistent locking scheme for closing.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 04, 2015 at 05:03:28PM +0300, Yishai Hadas wrote: > Currently, IB/cma remove_one flow blocks until all user descriptor managed by > IB/ucma are released. This prevents hot-removal of IB devices. This patch > allows IB/cma to remove devices regardless of user space activity. Upon getting > the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources > for the given ucontext. The ucontext itself is still alive till its explicit > destroying by its creator. Implementation aside, This changes the policy of the ucma from Tell user space and expect it to synchronously clean up To Tell user space we already nuked the RDMA device asynchronously Do we even want to do that unconditionally? Shouldn't the kernel at least give userspace some time to respond to the event before nuking the world? I'm now also wondering if we have some ordering issues on removal. Will uverbs continue to work while ucma is active? It looks like at least we can rmmod ib_uverbs and stuff goes really sideways for poor apps. That is probably a corner case though.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 8/5/2015 1:09 AM, Jason Gunthorpe wrote: > On Tue, Aug 04, 2015 at 05:03:28PM +0300, Yishai Hadas wrote: >> Currently, IB/cma remove_one flow blocks until all user descriptor managed by >> IB/ucma are released. This prevents hot-removal of IB devices. This patch >> allows IB/cma to remove devices regardless of user space activity. Upon getting >> the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources >> for the given ucontext. The ucontext itself is still alive till its explicit >> destroying by its creator. >> >> Running applications at that time will have some zombie device, further >> operations may fail. > > I've never made it far enough to look at this patch before... It is > Sean's stuff so he is best qualified to comment on it. The rest of the > series is OK now Sean. > >> +static void ucma_close_id(struct work_struct *work) >> +{ >> + struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); >> + >> + /* Fence to ensure that ctx->closing was seen by all >> + * ucma_get_ctx running >> + */ >> + mutex_lock(&mut); >> + mutex_unlock(&mut); > > Uh, no. That isn't how to use mutex's. That locking scheme is used in some other places in the kernel when a barrier/fence is needed, see below some examples of that usage and my comment around the usage. One of (see below in cma.c) was added by Sean Hefty as part of commit ID a396d43a in rdma_destroy_id, it's part of same kernel subsystem and uses same concept as of above code. I believe that we can go ahead and merge this patch as part of the series. If you still don't agree we can go ahead and merge the first 5 patches that doesn't depend on that last patch and handle/discuss this patch later. The first 4 you already review and acked the fifth is in mlx4_ib and has no comments to fix so far - some months in the list. Some examples of that usage: ------------------------------ http://lxr.free-electrons.com/source/kernel/audit.c#L532 /* wait for parent to finish and send an ACK */ 532 mutex_lock(&audit_cmd_mutex); 533 mutex_unlock(&audit_cmd_mutex); http://lxr.free-electrons.com/source/fs/configfs/dir.c#L1386 1386 /* Wait until the racing operation terminates */ 1387 mutex_lock(wait_mutex); 1388 mutex_unlock(wait_mutex); http://lxr.free-electrons.com/source/drivers/infiniband/core/cma.c#L1053 1050 /* Wait for any active callback to finish. New callbacks will find 1051 * the id_priv state set to destroying and abort. 1052 */ 1053 mutex_lock(&id_priv->handler_mutex); 1054 mutex_unlock(&id_priv->handler_mutex); http://lxr.free-electrons.com/source/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c#L96 96 /* barrier */ 97 mutex_lock(&sec_gc_mutex); 98 mutex_unlock(&sec_gc_mutex); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 8/5/2015 3:23 AM, Jason Gunthorpe wrote: > On Tue, Aug 04, 2015 at 05:03:28PM +0300, Yishai Hadas wrote: >> Currently, IB/cma remove_one flow blocks until all user descriptor managed by >> IB/ucma are released. This prevents hot-removal of IB devices. This patch >> allows IB/cma to remove devices regardless of user space activity. Upon getting >> the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources >> for the given ucontext. The ucontext itself is still alive till its explicit >> destroying by its creator. > > Implementation aside, > > This changes the policy of the ucma from > Tell user space and expect it to synchronously clean up > To > Tell user space we already nuked the RDMA device asynchronously > > Do we even want to do that unconditionally? Yes, the kernel should not depend on userspace applications to approve when it has some fatal error or device is removed/unbinded. > > Shouldn't the kernel at least give userspace some time to respond to > the event before nuking the world? No, the kernel activity is asynchronous to user-space, has higher priority and should not wait for. The kernel raises up an event (RDMA_CM_EVENT_DEVICE_REMOVAL) let userspace to know that the device was removed and continue. Application that handles events (as expected to do ..) should get it and take relevant action items. Further application calls will result in an error but application will stay alive. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 05, 2015 at 06:09:45PM +0300, Yishai Hadas wrote: > >>+static void ucma_close_id(struct work_struct *work) > >>+{ > >>+ struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); > >>+ > >>+ /* Fence to ensure that ctx->closing was seen by all > >>+ * ucma_get_ctx running > >>+ */ > >>+ mutex_lock(&mut); > >>+ mutex_unlock(&mut); > > > >Uh, no. That isn't how to use mutex's. > > That locking scheme is used in some other places in the kernel when a > barrier/fence is needed, see below some examples of that usage and my > comment around the usage. 'I saw it used someplace else' isn't much of an argument.. Maybe those places are sane, this isn't so much.. > If you still don't agree we can go ahead and merge the first 5 patches that > doesn't depend on that last patch and handle/discuss this patch later. I have no real problem with that, it would be nice to have an answer to the uverbs vs ucma removal ordering question and the basic issue of if we even want to do this so async before merging the driver patch that enables it.. > The first 4 you already review and acked the fifth is in mlx4_ib and has no > comments to fix so far - some months in the list. Personally, I'm not looking much at drivers.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Jason Gunthorpe > > I have no real problem with that, it would be nice to have an answer > to the uverbs vs ucma removal ordering question and the basic issue of > if we even want to do this so async before merging the driver patch > that enables it.. > I don't think that the order matters. When you do a surprise removal, you disconnect the application from both ucma and uverbs device references. In this state, the only thing that the application can do is close all handles. It doesn't matter if you succeed closing a resource on a device that is still accessible, or "close" an already destroyed resource... The primary reason for device surprise removal is to recover from device fatal errors or PCI errors. A quick asynchronous device recovery is exactly what we need, as there is nothing that the application can do more at this state. This is the most pressing issue that this patch-set solves. For the administrative removal use case, can consider providing applications a grace period before disconnecting them in a future patch set. Even here, however, I am not sure that if an administrator decided to remove a device while there are still active applications, we should stall his request. Anyway, let's get this patch-set done. > > The first 4 you already review and acked the fifth is in mlx4_ib and has no > > comments to fix so far - some months in the list. > > Personally, I'm not looking much at drivers.. > > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 06, 2015 at 03:36:39PM +0000, Liran Liss wrote: > I don't think that the order matters. It does if you want a planned 'gental' removal to be possible.. > When you do a surprise removal, you disconnect the application from > both ucma and uverbs device references. In this state, the only > thing that the application can do is close all handles. It doesn't > matter if you succeed closing a resource on a device that is still > accessible, or "close" an already destroyed resource... Hurm. That makese sense, but that isn't entirely what this patch set does - it is rough when facing the user, but the kernel side is actually quite gental.. Ie the RDMA CM still sends GMPs on this shutdown path.. Looks like the various storage ULP shutdowns are pretty gental too. So, it is easy to get confused what the point here is.. > The primary reason for device surprise removal is to recover from > device fatal errors or PCI errors. Okay, this makes sense to me. Maybe some more patches will come to be rougher to the kernel ULPs.. > For the administrative removal use case, can consider providing > applications a grace period before disconnecting them in a future > patch set. Even here, however, I am not sure that if an > administrator decided to remove a device while there are still > active applications, we should stall his request. Well, it can be discussed separately. As an example, I think it is important, if you want to hot-unplug a storage controller the admin expects a clean storage shutdown. I think in-kernel storage drivers do that already. IB shouldn't really be much different. > Anyway, let's get this patch-set done. Still needs to have the locking fixed.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > > It does if you want a planned 'gental' removal to be possible.. > There could be a lot of design options for a 'gentle' removal, such as first sending a 'prepare' event, and only then doing the flow proposed here. I do not want to address this now. > > When you do a surprise removal, you disconnect the application from > > both ucma and uverbs device references. In this state, the only > > thing that the application can do is close all handles. It doesn't > > matter if you succeed closing a resource on a device that is still > > accessible, or "close" an already destroyed resource... > > Hurm. That makese sense, but that isn't entirely what this patch set > does - it is rough when facing the user, but the kernel side is > actually quite gental.. Ie the RDMA CM still sends GMPs on this > shutdown path.. Looks like the various storage ULP shutdowns are > pretty gental too. > Right. Kernel removal is always gentle, because kernel ULPs are trusted to close within a finite time. > As an example, I think it is important, if you > want to hot-unplug a storage controller the admin expects a clean > storage shutdown. I think in-kernel storage drivers do that > already. IB shouldn't really be much different. > If a storage stack has user-space components, then our future 'gentle' closing will take care of that. > > Anyway, let's get this patch-set done. > > Still needs to have the locking fixed.. As pointed out (https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg27023.html), this "barrier" scheme is used in several places including in the RDMA stack. If there is no functional issue with the code, we can discuss how to address this scheme systematically later. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 11, 2015 at 12:47:03PM +0000, Liran Liss wrote: > > Still needs to have the locking fixed.. > > As pointed out > (https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg27023.html), > this "barrier" scheme is used in several places including in the > RDMA stack. If there is no functional issue with the code, we can > discuss how to address this scheme systematically later. That's not an argument. closing has nonsensical locking, the ugly empty mutex is part of that same mess. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 29b2121..e9037df 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -74,6 +74,7 @@ struct ucma_file { struct list_head ctx_list; struct list_head event_list; wait_queue_head_t poll_wait; + struct workqueue_struct *close_wq; }; struct ucma_context { @@ -89,6 +90,9 @@ struct ucma_context { struct list_head list; struct list_head mc_list; + int closing; + int destroying; + struct work_struct close_work; }; struct ucma_multicast { @@ -107,6 +111,7 @@ struct ucma_event { struct list_head list; struct rdma_cm_id *cm_id; struct rdma_ucm_event_resp resp; + struct work_struct close_work; }; static DEFINE_MUTEX(mut); @@ -132,8 +137,12 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id) mutex_lock(&mut); ctx = _ucma_find_context(id, file); - if (!IS_ERR(ctx)) - atomic_inc(&ctx->ref); + if (!IS_ERR(ctx)) { + if (ctx->closing) + ctx = ERR_PTR(-EIO); + else + atomic_inc(&ctx->ref); + } mutex_unlock(&mut); return ctx; } @@ -144,6 +153,34 @@ static void ucma_put_ctx(struct ucma_context *ctx) complete(&ctx->comp); } +static void ucma_close_event_id(struct work_struct *work) +{ + struct ucma_event *uevent_close = container_of(work, struct ucma_event, close_work); + + rdma_destroy_id(uevent_close->cm_id); + kfree(uevent_close); +} + +static void ucma_close_id(struct work_struct *work) +{ + struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); + + /* Fence to ensure that ctx->closing was seen by all + * ucma_get_ctx running + */ + mutex_lock(&mut); + mutex_unlock(&mut); + + /* once all inflight tasks are finished, we close all underlying + * resources. The context is still alive till its explicit destryoing + * by its creator. + */ + ucma_put_ctx(ctx); + wait_for_completion(&ctx->comp); + /* No new events will be generated after destroying the id. */ + rdma_destroy_id(ctx->cm_id); +} + static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file) { struct ucma_context *ctx; @@ -152,6 +189,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file) if (!ctx) return NULL; + INIT_WORK(&ctx->close_work, ucma_close_id); atomic_set(&ctx->ref, 1); init_completion(&ctx->comp); INIT_LIST_HEAD(&ctx->mc_list); @@ -242,6 +280,42 @@ static void ucma_set_event_context(struct ucma_context *ctx, } } +/* Called with file->mut locked for the relevant context. */ +static void ucma_removal_event_handler(struct rdma_cm_id *cm_id) +{ + struct ucma_context *ctx = cm_id->context; + struct ucma_event *con_req_eve; + int event_found = 0; + + if (ctx->destroying) + return; + + /* only if context is pointing to cm_id that it owns it and can be + * queued to be closed, otherwise that cm_id is an inflight one that + * is part of that context event list pending to be detached and + * reattached to its new context as part of ucma_get_event, + * handled separately below. + */ + if (ctx->cm_id == cm_id) { + ctx->closing = 1; + queue_work(ctx->file->close_wq, &ctx->close_work); + return; + } + + list_for_each_entry(con_req_eve, &ctx->file->event_list, list) { + if (con_req_eve->cm_id == cm_id && + con_req_eve->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) { + list_del(&con_req_eve->list); + INIT_WORK(&con_req_eve->close_work, ucma_close_event_id); + queue_work(ctx->file->close_wq, &con_req_eve->close_work); + event_found = 1; + break; + } + } + if (!event_found) + printk(KERN_ERR "ucma_removal_event_handler: warning: connect request event wasn't found\n"); +} + static int ucma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { @@ -276,14 +350,21 @@ static int ucma_event_handler(struct rdma_cm_id *cm_id, * We ignore events for new connections until userspace has set * their context. This can only happen if an error occurs on a * new connection before the user accepts it. This is okay, - * since the accept will just fail later. + * since the accept will just fail later. However, we do need + * to release the underlying HW resources in case of a device + * removal event. */ + if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) + ucma_removal_event_handler(cm_id); + kfree(uevent); goto out; } list_add_tail(&uevent->list, &ctx->file->event_list); wake_up_interruptible(&ctx->file->poll_wait); + if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) + ucma_removal_event_handler(cm_id); out: mutex_unlock(&ctx->file->mut); return ret; @@ -442,9 +523,15 @@ static void ucma_cleanup_mc_events(struct ucma_multicast *mc) } /* - * We cannot hold file->mut when calling rdma_destroy_id() or we can - * deadlock. We also acquire file->mut in ucma_event_handler(), and - * rdma_destroy_id() will wait until all callbacks have completed. + * ucma_free_ctx is called after the underlying rdma CM-ID is destroyed. At + * this point, no new events will be reported from the hardware. However, we + * still need to cleanup the UCMA context for this ID. Specifically, there + * might be events that have not yet been consumed by the user space software. + * These might include pending connect requests which we have not completed + * processing. We cannot call rdma_destroy_id while holding the lock of the + * context (file->mut), as it might cause a deadlock. We therefore extract all + * relevant events from the context pending events list while holding the + * mutex. After that we release them as needed. */ static int ucma_free_ctx(struct ucma_context *ctx) { @@ -452,8 +539,6 @@ static int ucma_free_ctx(struct ucma_context *ctx) struct ucma_event *uevent, *tmp; LIST_HEAD(list); - /* No new events will be generated after destroying the id. */ - rdma_destroy_id(ctx->cm_id); ucma_cleanup_multicast(ctx); @@ -501,10 +586,20 @@ static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf, if (IS_ERR(ctx)) return PTR_ERR(ctx); - ucma_put_ctx(ctx); - wait_for_completion(&ctx->comp); - resp.events_reported = ucma_free_ctx(ctx); + mutex_lock(&ctx->file->mut); + ctx->destroying = 1; + mutex_unlock(&ctx->file->mut); + + flush_workqueue(ctx->file->close_wq); + /* At this point it's guaranteed that there is no inflight + * closing task */ + if (!ctx->closing) { + ucma_put_ctx(ctx); + wait_for_completion(&ctx->comp); + rdma_destroy_id(ctx->cm_id); + } + resp.events_reported = ucma_free_ctx(ctx); if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) ret = -EFAULT; @@ -1529,6 +1624,7 @@ static int ucma_open(struct inode *inode, struct file *filp) INIT_LIST_HEAD(&file->ctx_list); init_waitqueue_head(&file->poll_wait); mutex_init(&file->mut); + file->close_wq = create_singlethread_workqueue("ucma_close_id"); filp->private_data = file; file->filp = filp; @@ -1543,16 +1639,28 @@ static int ucma_close(struct inode *inode, struct file *filp) mutex_lock(&file->mut); list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) { + ctx->destroying = 1; mutex_unlock(&file->mut); mutex_lock(&mut); idr_remove(&ctx_idr, ctx->id); mutex_unlock(&mut); + flush_workqueue(file->close_wq); + /* At that step once ctx was marked as destroying and workqueue + * was flushed we are safe from any inflights handlers that + * might put other closing task. + */ + if (!ctx->closing) + /* rdma_destroy_id ensures that no event handlers are + * inflight for that id before releasing it. + */ + rdma_destroy_id(ctx->cm_id); ucma_free_ctx(ctx); mutex_lock(&file->mut); } mutex_unlock(&file->mut); + destroy_workqueue(file->close_wq); kfree(file); return 0; }