diff mbox series

[rdma-next] RDMA/restrack: Delay QP deletion till all users are gone

Message ID 9ba5a611ceac86774d3d0fda12704cecc30606f9.1618753038.git.leonro@nvidia.com (mailing list archive)
State Changes Requested
Headers show
Series [rdma-next] RDMA/restrack: Delay QP deletion till all users are gone | expand

Commit Message

Leon Romanovsky April 18, 2021, 1:37 p.m. UTC
From: Shay Drory <shayd@nvidia.com>

Currently, in case of QP, the following use-after-free is possible:

	cpu0				cpu1
	----				----
 res_get_common_dumpit()
 rdma_restrack_get()
 fill_res_qp_entry()
				ib_destroy_qp_user()
				 rdma_restrack_del()
				 qp->device->ops.destroy_qp()
  ib_query_qp()
  qp->device->ops.query_qp()
    --> use-after-free-qp

This is because rdma_restrack_del(), in case of QP, isn't waiting until
all users are gone.

Fix it by making rdma_restrack_del() wait until all users are gone for
QPs as well.

Fixes: 13ef5539def7 ("RDMA/restrack: Count references to the verbs objects")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/restrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jason Gunthorpe April 20, 2021, 12:39 p.m. UTC | #1
On Sun, Apr 18, 2021 at 04:37:35PM +0300, Leon Romanovsky wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> Currently, in case of QP, the following use-after-free is possible:
> 
> 	cpu0				cpu1
> 	----				----
>  res_get_common_dumpit()
>  rdma_restrack_get()
>  fill_res_qp_entry()
> 				ib_destroy_qp_user()
> 				 rdma_restrack_del()
> 				 qp->device->ops.destroy_qp()
>   ib_query_qp()
>   qp->device->ops.query_qp()
>     --> use-after-free-qp
> 
> This is because rdma_restrack_del(), in case of QP, isn't waiting until
> all users are gone.
> 
> Fix it by making rdma_restrack_del() wait until all users are gone for
> QPs as well.
> 
> Fixes: 13ef5539def7 ("RDMA/restrack: Count references to the verbs objects")
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/core/restrack.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
> index ffabaf327242..def0c5b0efe9 100644
> --- a/drivers/infiniband/core/restrack.c
> +++ b/drivers/infiniband/core/restrack.c
> @@ -340,7 +340,7 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
>  	rt = &dev->res[res->type];
>  
>  	old = xa_erase(&rt->xa, res->id);
> -	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
> +	if (res->type == RDMA_RESTRACK_MR)
>  		return;

Why is MR skipping this?


It also calls into the driver under its dumpit, at the very least this
needs a comment.

Jason
Leon Romanovsky April 20, 2021, 1:06 p.m. UTC | #2
On Tue, Apr 20, 2021 at 09:39:50AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 18, 2021 at 04:37:35PM +0300, Leon Romanovsky wrote:
> > From: Shay Drory <shayd@nvidia.com>
> > 
> > Currently, in case of QP, the following use-after-free is possible:
> > 
> > 	cpu0				cpu1
> > 	----				----
> >  res_get_common_dumpit()
> >  rdma_restrack_get()
> >  fill_res_qp_entry()
> > 				ib_destroy_qp_user()
> > 				 rdma_restrack_del()
> > 				 qp->device->ops.destroy_qp()
> >   ib_query_qp()
> >   qp->device->ops.query_qp()
> >     --> use-after-free-qp
> > 
> > This is because rdma_restrack_del(), in case of QP, isn't waiting until
> > all users are gone.
> > 
> > Fix it by making rdma_restrack_del() wait until all users are gone for
> > QPs as well.
> > 
> > Fixes: 13ef5539def7 ("RDMA/restrack: Count references to the verbs objects")
> > Signed-off-by: Shay Drory <shayd@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/infiniband/core/restrack.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
> > index ffabaf327242..def0c5b0efe9 100644
> > --- a/drivers/infiniband/core/restrack.c
> > +++ b/drivers/infiniband/core/restrack.c
> > @@ -340,7 +340,7 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
> >  	rt = &dev->res[res->type];
> >  
> >  	old = xa_erase(&rt->xa, res->id);
> > -	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
> > +	if (res->type == RDMA_RESTRACK_MR)
> >  		return;
> 
> Why is MR skipping this?

I don't remember the justification for RDMA_RESTRACK_MR || RDMA_RESTRACK_QP check.
My guess that it is related to the allocation flow (both are not converted) and
have internal objects to the drivers.

> 
> 
> It also calls into the driver under its dumpit, at the very least this
> needs a comment.
> 
> Jason
Jason Gunthorpe April 20, 2021, 3:25 p.m. UTC | #3
On Tue, Apr 20, 2021 at 04:06:03PM +0300, Leon Romanovsky wrote:
> On Tue, Apr 20, 2021 at 09:39:50AM -0300, Jason Gunthorpe wrote:
> > On Sun, Apr 18, 2021 at 04:37:35PM +0300, Leon Romanovsky wrote:
> > > From: Shay Drory <shayd@nvidia.com>
> > > 
> > > Currently, in case of QP, the following use-after-free is possible:
> > > 
> > > 	cpu0				cpu1
> > >  res_get_common_dumpit()
> > >  rdma_restrack_get()
> > >  fill_res_qp_entry()
> > > 				ib_destroy_qp_user()
> > > 				 rdma_restrack_del()
> > > 				 qp->device->ops.destroy_qp()
> > >   ib_query_qp()
> > >   qp->device->ops.query_qp()
> > > 
> > > This is because rdma_restrack_del(), in case of QP, isn't waiting until
> > > all users are gone.
> > > 
> > > Fix it by making rdma_restrack_del() wait until all users are gone for
> > > QPs as well.
> > > 
> > > Fixes: 13ef5539def7 ("RDMA/restrack: Count references to the verbs objects")
> > > Signed-off-by: Shay Drory <shayd@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > >  drivers/infiniband/core/restrack.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
> > > index ffabaf327242..def0c5b0efe9 100644
> > > +++ b/drivers/infiniband/core/restrack.c
> > > @@ -340,7 +340,7 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
> > >  	rt = &dev->res[res->type];
> > >  
> > >  	old = xa_erase(&rt->xa, res->id);
> > > -	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
> > > +	if (res->type == RDMA_RESTRACK_MR)
> > >  		return;
> > 
> > Why is MR skipping this?
> 
> I don't remember the justification for RDMA_RESTRACK_MR || RDMA_RESTRACK_QP check.
> My guess that it is related to the allocation flow (both are not converted) and
> have internal objects to the drivers.

Well, I don't understand why QP has a bug but MR would not have the
same one??

Jason
Leon Romanovsky April 21, 2021, 5:03 a.m. UTC | #4
On Tue, Apr 20, 2021 at 12:25:41PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 20, 2021 at 04:06:03PM +0300, Leon Romanovsky wrote:
> > On Tue, Apr 20, 2021 at 09:39:50AM -0300, Jason Gunthorpe wrote:
> > > On Sun, Apr 18, 2021 at 04:37:35PM +0300, Leon Romanovsky wrote:
> > > > From: Shay Drory <shayd@nvidia.com>
> > > > 
> > > > Currently, in case of QP, the following use-after-free is possible:
> > > > 
> > > > 	cpu0				cpu1
> > > >  res_get_common_dumpit()
> > > >  rdma_restrack_get()
> > > >  fill_res_qp_entry()
> > > > 				ib_destroy_qp_user()
> > > > 				 rdma_restrack_del()
> > > > 				 qp->device->ops.destroy_qp()
> > > >   ib_query_qp()
> > > >   qp->device->ops.query_qp()
> > > > 
> > > > This is because rdma_restrack_del(), in case of QP, isn't waiting until
> > > > all users are gone.
> > > > 
> > > > Fix it by making rdma_restrack_del() wait until all users are gone for
> > > > QPs as well.
> > > > 
> > > > Fixes: 13ef5539def7 ("RDMA/restrack: Count references to the verbs objects")
> > > > Signed-off-by: Shay Drory <shayd@nvidia.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > >  drivers/infiniband/core/restrack.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
> > > > index ffabaf327242..def0c5b0efe9 100644
> > > > +++ b/drivers/infiniband/core/restrack.c
> > > > @@ -340,7 +340,7 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
> > > >  	rt = &dev->res[res->type];
> > > >  
> > > >  	old = xa_erase(&rt->xa, res->id);
> > > > -	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
> > > > +	if (res->type == RDMA_RESTRACK_MR)
> > > >  		return;
> > > 
> > > Why is MR skipping this?
> > 
> > I don't remember the justification for RDMA_RESTRACK_MR || RDMA_RESTRACK_QP check.
> > My guess that it is related to the allocation flow (both are not converted) and
> > have internal objects to the drivers.
> 
> Well, I don't understand why QP has a bug but MR would not have the
> same one??

I didn't understand when reviewed either, but decided to post it anyway
to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
check.

Thanks

> 
> Jason
Jason Gunthorpe April 22, 2021, 2:29 p.m. UTC | #5
On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:

> I didn't understand when reviewed either, but decided to post it anyway
> to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> check.

I think the whole thing should look more like this and we delete the
if entirely.

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2b9ffc21cbc4ad..479b16b8f6723a 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1860,6 +1860,7 @@ static void _destroy_id(struct rdma_id_private *id_priv,
 				iw_destroy_cm_id(id_priv->cm_id.iw);
 		}
 		cma_leave_mc_groups(id_priv);
+		rdma_restrack_del(&id_priv->res);
 		cma_release_dev(id_priv);
 	}
 
@@ -1873,7 +1874,6 @@ static void _destroy_id(struct rdma_id_private *id_priv,
 	kfree(id_priv->id.route.path_rec);
 
 	put_net(id_priv->id.route.addr.dev_addr.net);
-	rdma_restrack_del(&id_priv->res);
 	kfree(id_priv);
 }
 
diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
index 15493357cfef09..1808bc2533f155 100644
--- a/drivers/infiniband/core/counters.c
+++ b/drivers/infiniband/core/counters.c
@@ -176,6 +176,8 @@ static void rdma_counter_free(struct rdma_counter *counter)
 {
 	struct rdma_port_counter *port_counter;
 
+	rdma_restrack_del(&counter->res);
+
 	port_counter = &counter->device->port_data[counter->port].port_counter;
 	mutex_lock(&port_counter->lock);
 	port_counter->num_counters--;
@@ -185,7 +187,6 @@ static void rdma_counter_free(struct rdma_counter *counter)
 
 	mutex_unlock(&port_counter->lock);
 
-	rdma_restrack_del(&counter->res);
 	kfree(counter->stats);
 	kfree(counter);
 }
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 433b426729d4ce..3884db637d33ab 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -339,11 +339,15 @@ void ib_free_cq(struct ib_cq *cq)
 		WARN_ON_ONCE(1);
 	}
 
+	rdma_restrack_prepare_del(&cq->res);
 	rdma_dim_destroy(cq);
 	trace_cq_free(cq);
 	ret = cq->device->ops.destroy_cq(cq, NULL);
-	WARN_ONCE(ret, "Destroy of kernel CQ shouldn't fail");
-	rdma_restrack_del(&cq->res);
+	if (WARN_ON(ret)) {
+		rdma_restrack_abort_del(&cq->res);
+		return;
+	}
+	rdma_restrack_finish_del(&cq->res);
 	kfree(cq->wc);
 	kfree(cq);
 }
diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
index 94d83b665a2fe0..f57234ced93ca1 100644
--- a/drivers/infiniband/core/rdma_core.c
+++ b/drivers/infiniband/core/rdma_core.c
@@ -854,6 +854,8 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
 	struct ib_ucontext *ucontext = ufile->ucontext;
 	struct ib_device *ib_dev = ucontext->device;
 
+	rdma_restrack_prepare_del(&ucontext->res);
+
 	/*
 	 * If we are closing the FD then the user mmap VMAs must have
 	 * already been destroyed as they hold on to the filep, otherwise
@@ -868,9 +870,8 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
 	ib_rdmacg_uncharge(&ucontext->cg_obj, ib_dev,
 			   RDMACG_RESOURCE_HCA_HANDLE);
 
-	rdma_restrack_del(&ucontext->res);
-
 	ib_dev->ops.dealloc_ucontext(ucontext);
+	rdma_restrack_finish_del(&ucontext->res);
 	WARN_ON(!xa_empty(&ucontext->mmap_xa));
 	kfree(ucontext);
 
diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
index 033207882c82ce..8aa1dae40f38a7 100644
--- a/drivers/infiniband/core/restrack.c
+++ b/drivers/infiniband/core/restrack.c
@@ -315,11 +315,49 @@ int rdma_restrack_put(struct rdma_restrack_entry *res)
 }
 EXPORT_SYMBOL(rdma_restrack_put);
 
-/**
- * rdma_restrack_del() - delete object from the reource tracking database
- * @res:  resource entry
- */
-void rdma_restrack_del(struct rdma_restrack_entry *res)
+void rdma_restrack_prepare_del(struct rdma_restrack_entry *res)
+{
+	struct rdma_restrack_entry *old;
+	struct rdma_restrack_root *rt;
+	struct ib_device *dev;
+
+	if (!res->valid)
+		return;
+
+	dev = res_to_dev(res);
+	rt = &dev->res[res->type];
+
+	if (!res->no_track) {
+		old = xa_cmpxchg(&rt->xa, res->id, res, XA_ZERO_ENTRY,
+				 GFP_KERNEL);
+		WARN_ON(old != res);
+	}
+
+	/* Fence access from all concurrent threads, like netlink */
+	rdma_restrack_put(res);
+	wait_for_completion(&res->comp);
+}
+EXPORT_SYMBOL(rdma_restrack_prepare_del);
+
+void rdma_restrack_abort_del(struct rdma_restrack_entry *res)
+{
+	struct rdma_restrack_entry *old;
+	struct rdma_restrack_root *rt;
+	struct ib_device *dev;
+
+	if (!res->valid || res->no_track)
+		return;
+
+	dev = res_to_dev(res);
+	rt = &dev->res[res->type];
+
+	kref_init(&res->kref);
+	old = xa_cmpxchg(&rt->xa, res->id, XA_ZERO_ENTRY, res, GFP_KERNEL);
+	WARN_ON(old != res);
+}
+EXPORT_SYMBOL(rdma_restrack_abort_del);
+
+void rdma_restrack_finish_del(struct rdma_restrack_entry *res)
 {
 	struct rdma_restrack_entry *old;
 	struct rdma_restrack_root *rt;
@@ -332,24 +370,22 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
 		}
 		return;
 	}
-
-	if (res->no_track)
-		goto out;
+	res->valid = false;
 
 	dev = res_to_dev(res);
-	if (WARN_ON(!dev))
-		return;
-
 	rt = &dev->res[res->type];
-
 	old = xa_erase(&rt->xa, res->id);
-	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
-		return;
-	WARN_ON(old != res);
+	WARN_ON(old != NULL);
+}
+EXPORT_SYMBOL(rdma_restrack_finish_del);
 
-out:
-	res->valid = false;
-	rdma_restrack_put(res);
-	wait_for_completion(&res->comp);
+/**
+ * rdma_restrack_del() - delete object from the reource tracking database
+ * @res:  resource entry
+ */
+void rdma_restrack_del(struct rdma_restrack_entry *res)
+{
+	rdma_restrack_prepare_del(res);
+	rdma_restrack_finish_del(res);
 }
 EXPORT_SYMBOL(rdma_restrack_del);
diff --git a/drivers/infiniband/core/restrack.h b/drivers/infiniband/core/restrack.h
index 6a04fc41f73801..87a670334acabe 100644
--- a/drivers/infiniband/core/restrack.h
+++ b/drivers/infiniband/core/restrack.h
@@ -27,6 +27,9 @@ int rdma_restrack_init(struct ib_device *dev);
 void rdma_restrack_clean(struct ib_device *dev);
 void rdma_restrack_add(struct rdma_restrack_entry *res);
 void rdma_restrack_del(struct rdma_restrack_entry *res);
+void rdma_restrack_prepare_del(struct rdma_restrack_entry *res);
+void rdma_restrack_finish_del(struct rdma_restrack_entry *res);
+void rdma_restrack_abort_del(struct rdma_restrack_entry *res);
 void rdma_restrack_new(struct rdma_restrack_entry *res,
 		       enum rdma_restrack_type type);
 void rdma_restrack_set_name(struct rdma_restrack_entry *res,
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 2b0798151fb753..2e761a7eb92847 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -346,13 +346,16 @@ int ib_dealloc_pd_user(struct ib_pd *pd, struct ib_udata *udata)
 	 */
 	WARN_ON(atomic_read(&pd->usecnt));
 
+	rdma_restrack_prepare_del(&pd->res);
 	ret = pd->device->ops.dealloc_pd(pd, udata);
-	if (ret)
+	if (ret) {
+		rdma_restrack_abort_del(&pd->res);
 		return ret;
+	}
 
-	rdma_restrack_del(&pd->res);
+	rdma_restrack_finish_del(&pd->res);
 	kfree(pd);
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL(ib_dealloc_pd_user);
 
@@ -1085,19 +1088,22 @@ int ib_destroy_srq_user(struct ib_srq *srq, struct ib_udata *udata)
 	if (atomic_read(&srq->usecnt))
 		return -EBUSY;
 
+	rdma_restrack_prepare_del(&srq->res);
 	ret = srq->device->ops.destroy_srq(srq, udata);
-	if (ret)
+	if (ret) {
+		rdma_restrack_abort_del(&srq->res);
 		return ret;
+	}
+	rdma_restrack_finish_del(&srq->res);
 
 	atomic_dec(&srq->pd->usecnt);
 	if (srq->srq_type == IB_SRQT_XRC)
 		atomic_dec(&srq->ext.xrc.xrcd->usecnt);
 	if (ib_srq_has_cq(srq->srq_type))
 		atomic_dec(&srq->ext.cq->usecnt);
-	rdma_restrack_del(&srq->res);
 	kfree(srq);
 
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL(ib_destroy_srq_user);
 
@@ -1963,31 +1969,33 @@ int ib_destroy_qp_user(struct ib_qp *qp, struct ib_udata *udata)
 		rdma_rw_cleanup_mrs(qp);
 
 	rdma_counter_unbind_qp(qp, true);
-	rdma_restrack_del(&qp->res);
+	rdma_restrack_prepare_del(&qp->res);
 	ret = qp->device->ops.destroy_qp(qp, udata);
-	if (!ret) {
-		if (alt_path_sgid_attr)
-			rdma_put_gid_attr(alt_path_sgid_attr);
-		if (av_sgid_attr)
-			rdma_put_gid_attr(av_sgid_attr);
-		if (pd)
-			atomic_dec(&pd->usecnt);
-		if (scq)
-			atomic_dec(&scq->usecnt);
-		if (rcq)
-			atomic_dec(&rcq->usecnt);
-		if (srq)
-			atomic_dec(&srq->usecnt);
-		if (ind_tbl)
-			atomic_dec(&ind_tbl->usecnt);
-		if (sec)
-			ib_destroy_qp_security_end(sec);
-	} else {
+	if (ret) {
+		rdma_restrack_abort_del(&qp->res);
 		if (sec)
 			ib_destroy_qp_security_abort(sec);
+		return ret;
 	}
 
-	return ret;
+	rdma_restrack_finish_del(&qp->res);
+	if (alt_path_sgid_attr)
+		rdma_put_gid_attr(alt_path_sgid_attr);
+	if (av_sgid_attr)
+		rdma_put_gid_attr(av_sgid_attr);
+	if (pd)
+		atomic_dec(&pd->usecnt);
+	if (scq)
+		atomic_dec(&scq->usecnt);
+	if (rcq)
+		atomic_dec(&rcq->usecnt);
+	if (srq)
+		atomic_dec(&srq->usecnt);
+	if (ind_tbl)
+		atomic_dec(&ind_tbl->usecnt);
+	if (sec)
+		ib_destroy_qp_security_end(sec);
+	return 0;
 }
 EXPORT_SYMBOL(ib_destroy_qp_user);
 
@@ -2050,13 +2058,16 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 	if (atomic_read(&cq->usecnt))
 		return -EBUSY;
 
+	rdma_restrack_prepare_del(&cq->res);
 	ret = cq->device->ops.destroy_cq(cq, udata);
-	if (ret)
+	if (ret) {
+		rdma_restrack_abort_del(&cq->res);
 		return ret;
+	}
 
-	rdma_restrack_del(&cq->res);
+	rdma_restrack_finish_del(&cq->res);
 	kfree(cq);
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL(ib_destroy_cq_user);
 
@@ -2126,16 +2137,19 @@ int ib_dereg_mr_user(struct ib_mr *mr, struct ib_udata *udata)
 	int ret;
 
 	trace_mr_dereg(mr);
-	rdma_restrack_del(&mr->res);
+	rdma_restrack_prepare_del(&mr->res);
 	ret = mr->device->ops.dereg_mr(mr, udata);
-	if (!ret) {
-		atomic_dec(&pd->usecnt);
-		if (dm)
-			atomic_dec(&dm->usecnt);
-		kfree(sig_attrs);
+	if (ret) {
+		rdma_restrack_abort_del(&mr->res);
+		return ret;
 	}
 
-	return ret;
+	rdma_restrack_finish_del(&mr->res);
+	atomic_dec(&pd->usecnt);
+	if (dm)
+		atomic_dec(&dm->usecnt);
+	kfree(sig_attrs);
+	return 0;
 }
 EXPORT_SYMBOL(ib_dereg_mr_user);
Leon Romanovsky April 25, 2021, 1:03 p.m. UTC | #6
On Thu, Apr 22, 2021 at 11:29:39AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:
> 
> > I didn't understand when reviewed either, but decided to post it anyway
> > to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> > check.
> 
> I think the whole thing should look more like this and we delete the
> if entirely.

I have mixed feelings about this approach. Before "destroy can fail disaster",
the restrack goal was to provide the following flow:
1. create new memory object - rdma_restrack_new()
2. create new HW object - .create_XXX() callback in the driver
3. add HW object to the DB - rdma_restrack_del()
....
4. wait for any work on this HW object to complete - rdma_restrack_del()
5. safely destroy HW object - .destroy_XXX()

I really would like to stay with this flow and block any access to the
object that failed to destruct - maybe add to some zombie list.

The proposed prepare/abort/finish flow is much harder to implement correctly.
Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
but didn't restore them after .destroy_qp() failure.

Thanks

> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 2b9ffc21cbc4ad..479b16b8f6723a 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -1860,6 +1860,7 @@ static void _destroy_id(struct rdma_id_private *id_priv,
>  				iw_destroy_cm_id(id_priv->cm_id.iw);
>  		}
>  		cma_leave_mc_groups(id_priv);
> +		rdma_restrack_del(&id_priv->res);
>  		cma_release_dev(id_priv);
>  	}
>  
> @@ -1873,7 +1874,6 @@ static void _destroy_id(struct rdma_id_private *id_priv,
>  	kfree(id_priv->id.route.path_rec);
>  
>  	put_net(id_priv->id.route.addr.dev_addr.net);
> -	rdma_restrack_del(&id_priv->res);
>  	kfree(id_priv);
>  }
>  
> diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
> index 15493357cfef09..1808bc2533f155 100644
> --- a/drivers/infiniband/core/counters.c
> +++ b/drivers/infiniband/core/counters.c
> @@ -176,6 +176,8 @@ static void rdma_counter_free(struct rdma_counter *counter)
>  {
>  	struct rdma_port_counter *port_counter;
>  
> +	rdma_restrack_del(&counter->res);
> +
>  	port_counter = &counter->device->port_data[counter->port].port_counter;
>  	mutex_lock(&port_counter->lock);
>  	port_counter->num_counters--;
> @@ -185,7 +187,6 @@ static void rdma_counter_free(struct rdma_counter *counter)
>  
>  	mutex_unlock(&port_counter->lock);
>  
> -	rdma_restrack_del(&counter->res);
>  	kfree(counter->stats);
>  	kfree(counter);
>  }
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index 433b426729d4ce..3884db637d33ab 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -339,11 +339,15 @@ void ib_free_cq(struct ib_cq *cq)
>  		WARN_ON_ONCE(1);
>  	}
>  
> +	rdma_restrack_prepare_del(&cq->res);
>  	rdma_dim_destroy(cq);
>  	trace_cq_free(cq);
>  	ret = cq->device->ops.destroy_cq(cq, NULL);
> -	WARN_ONCE(ret, "Destroy of kernel CQ shouldn't fail");
> -	rdma_restrack_del(&cq->res);
> +	if (WARN_ON(ret)) {
> +		rdma_restrack_abort_del(&cq->res);
> +		return;
> +	}
> +	rdma_restrack_finish_del(&cq->res);
>  	kfree(cq->wc);
>  	kfree(cq);
>  }
> diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
> index 94d83b665a2fe0..f57234ced93ca1 100644
> --- a/drivers/infiniband/core/rdma_core.c
> +++ b/drivers/infiniband/core/rdma_core.c
> @@ -854,6 +854,8 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
>  	struct ib_ucontext *ucontext = ufile->ucontext;
>  	struct ib_device *ib_dev = ucontext->device;
>  
> +	rdma_restrack_prepare_del(&ucontext->res);
> +
>  	/*
>  	 * If we are closing the FD then the user mmap VMAs must have
>  	 * already been destroyed as they hold on to the filep, otherwise
> @@ -868,9 +870,8 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
>  	ib_rdmacg_uncharge(&ucontext->cg_obj, ib_dev,
>  			   RDMACG_RESOURCE_HCA_HANDLE);
>  
> -	rdma_restrack_del(&ucontext->res);
> -
>  	ib_dev->ops.dealloc_ucontext(ucontext);
> +	rdma_restrack_finish_del(&ucontext->res);
>  	WARN_ON(!xa_empty(&ucontext->mmap_xa));
>  	kfree(ucontext);
>  
> diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
> index 033207882c82ce..8aa1dae40f38a7 100644
> --- a/drivers/infiniband/core/restrack.c
> +++ b/drivers/infiniband/core/restrack.c
> @@ -315,11 +315,49 @@ int rdma_restrack_put(struct rdma_restrack_entry *res)
>  }
>  EXPORT_SYMBOL(rdma_restrack_put);
>  
> -/**
> - * rdma_restrack_del() - delete object from the reource tracking database
> - * @res:  resource entry
> - */
> -void rdma_restrack_del(struct rdma_restrack_entry *res)
> +void rdma_restrack_prepare_del(struct rdma_restrack_entry *res)
> +{
> +	struct rdma_restrack_entry *old;
> +	struct rdma_restrack_root *rt;
> +	struct ib_device *dev;
> +
> +	if (!res->valid)
> +		return;
> +
> +	dev = res_to_dev(res);
> +	rt = &dev->res[res->type];
> +
> +	if (!res->no_track) {
> +		old = xa_cmpxchg(&rt->xa, res->id, res, XA_ZERO_ENTRY,
> +				 GFP_KERNEL);
> +		WARN_ON(old != res);
> +	}
> +
> +	/* Fence access from all concurrent threads, like netlink */
> +	rdma_restrack_put(res);
> +	wait_for_completion(&res->comp);
> +}
> +EXPORT_SYMBOL(rdma_restrack_prepare_del);
> +
> +void rdma_restrack_abort_del(struct rdma_restrack_entry *res)
> +{
> +	struct rdma_restrack_entry *old;
> +	struct rdma_restrack_root *rt;
> +	struct ib_device *dev;
> +
> +	if (!res->valid || res->no_track)
> +		return;
> +
> +	dev = res_to_dev(res);
> +	rt = &dev->res[res->type];
> +
> +	kref_init(&res->kref);
> +	old = xa_cmpxchg(&rt->xa, res->id, XA_ZERO_ENTRY, res, GFP_KERNEL);
> +	WARN_ON(old != res);
> +}
> +EXPORT_SYMBOL(rdma_restrack_abort_del);
> +
> +void rdma_restrack_finish_del(struct rdma_restrack_entry *res)
>  {
>  	struct rdma_restrack_entry *old;
>  	struct rdma_restrack_root *rt;
> @@ -332,24 +370,22 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
>  		}
>  		return;
>  	}
> -
> -	if (res->no_track)
> -		goto out;
> +	res->valid = false;
>  
>  	dev = res_to_dev(res);
> -	if (WARN_ON(!dev))
> -		return;
> -
>  	rt = &dev->res[res->type];
> -
>  	old = xa_erase(&rt->xa, res->id);
> -	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
> -		return;
> -	WARN_ON(old != res);
> +	WARN_ON(old != NULL);
> +}
> +EXPORT_SYMBOL(rdma_restrack_finish_del);
>  
> -out:
> -	res->valid = false;
> -	rdma_restrack_put(res);
> -	wait_for_completion(&res->comp);
> +/**
> + * rdma_restrack_del() - delete object from the reource tracking database
> + * @res:  resource entry
> + */
> +void rdma_restrack_del(struct rdma_restrack_entry *res)
> +{
> +	rdma_restrack_prepare_del(res);
> +	rdma_restrack_finish_del(res);
>  }
>  EXPORT_SYMBOL(rdma_restrack_del);
> diff --git a/drivers/infiniband/core/restrack.h b/drivers/infiniband/core/restrack.h
> index 6a04fc41f73801..87a670334acabe 100644
> --- a/drivers/infiniband/core/restrack.h
> +++ b/drivers/infiniband/core/restrack.h
> @@ -27,6 +27,9 @@ int rdma_restrack_init(struct ib_device *dev);
>  void rdma_restrack_clean(struct ib_device *dev);
>  void rdma_restrack_add(struct rdma_restrack_entry *res);
>  void rdma_restrack_del(struct rdma_restrack_entry *res);
> +void rdma_restrack_prepare_del(struct rdma_restrack_entry *res);
> +void rdma_restrack_finish_del(struct rdma_restrack_entry *res);
> +void rdma_restrack_abort_del(struct rdma_restrack_entry *res);
>  void rdma_restrack_new(struct rdma_restrack_entry *res,
>  		       enum rdma_restrack_type type);
>  void rdma_restrack_set_name(struct rdma_restrack_entry *res,
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 2b0798151fb753..2e761a7eb92847 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -346,13 +346,16 @@ int ib_dealloc_pd_user(struct ib_pd *pd, struct ib_udata *udata)
>  	 */
>  	WARN_ON(atomic_read(&pd->usecnt));
>  
> +	rdma_restrack_prepare_del(&pd->res);
>  	ret = pd->device->ops.dealloc_pd(pd, udata);
> -	if (ret)
> +	if (ret) {
> +		rdma_restrack_abort_del(&pd->res);
>  		return ret;
> +	}
>  
> -	rdma_restrack_del(&pd->res);
> +	rdma_restrack_finish_del(&pd->res);
>  	kfree(pd);
> -	return ret;
> +	return 0;
>  }
>  EXPORT_SYMBOL(ib_dealloc_pd_user);
>  
> @@ -1085,19 +1088,22 @@ int ib_destroy_srq_user(struct ib_srq *srq, struct ib_udata *udata)
>  	if (atomic_read(&srq->usecnt))
>  		return -EBUSY;
>  
> +	rdma_restrack_prepare_del(&srq->res);
>  	ret = srq->device->ops.destroy_srq(srq, udata);
> -	if (ret)
> +	if (ret) {
> +		rdma_restrack_abort_del(&srq->res);
>  		return ret;
> +	}
> +	rdma_restrack_finish_del(&srq->res);
>  
>  	atomic_dec(&srq->pd->usecnt);
>  	if (srq->srq_type == IB_SRQT_XRC)
>  		atomic_dec(&srq->ext.xrc.xrcd->usecnt);
>  	if (ib_srq_has_cq(srq->srq_type))
>  		atomic_dec(&srq->ext.cq->usecnt);
> -	rdma_restrack_del(&srq->res);
>  	kfree(srq);
>  
> -	return ret;
> +	return 0;
>  }
>  EXPORT_SYMBOL(ib_destroy_srq_user);
>  
> @@ -1963,31 +1969,33 @@ int ib_destroy_qp_user(struct ib_qp *qp, struct ib_udata *udata)
>  		rdma_rw_cleanup_mrs(qp);
>  
>  	rdma_counter_unbind_qp(qp, true);
> -	rdma_restrack_del(&qp->res);
> +	rdma_restrack_prepare_del(&qp->res);
>  	ret = qp->device->ops.destroy_qp(qp, udata);
> -	if (!ret) {
> -		if (alt_path_sgid_attr)
> -			rdma_put_gid_attr(alt_path_sgid_attr);
> -		if (av_sgid_attr)
> -			rdma_put_gid_attr(av_sgid_attr);
> -		if (pd)
> -			atomic_dec(&pd->usecnt);
> -		if (scq)
> -			atomic_dec(&scq->usecnt);
> -		if (rcq)
> -			atomic_dec(&rcq->usecnt);
> -		if (srq)
> -			atomic_dec(&srq->usecnt);
> -		if (ind_tbl)
> -			atomic_dec(&ind_tbl->usecnt);
> -		if (sec)
> -			ib_destroy_qp_security_end(sec);
> -	} else {
> +	if (ret) {
> +		rdma_restrack_abort_del(&qp->res);
>  		if (sec)
>  			ib_destroy_qp_security_abort(sec);
> +		return ret;
>  	}
>  
> -	return ret;
> +	rdma_restrack_finish_del(&qp->res);
> +	if (alt_path_sgid_attr)
> +		rdma_put_gid_attr(alt_path_sgid_attr);
> +	if (av_sgid_attr)
> +		rdma_put_gid_attr(av_sgid_attr);
> +	if (pd)
> +		atomic_dec(&pd->usecnt);
> +	if (scq)
> +		atomic_dec(&scq->usecnt);
> +	if (rcq)
> +		atomic_dec(&rcq->usecnt);
> +	if (srq)
> +		atomic_dec(&srq->usecnt);
> +	if (ind_tbl)
> +		atomic_dec(&ind_tbl->usecnt);
> +	if (sec)
> +		ib_destroy_qp_security_end(sec);
> +	return 0;
>  }
>  EXPORT_SYMBOL(ib_destroy_qp_user);
>  
> @@ -2050,13 +2058,16 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
>  	if (atomic_read(&cq->usecnt))
>  		return -EBUSY;
>  
> +	rdma_restrack_prepare_del(&cq->res);
>  	ret = cq->device->ops.destroy_cq(cq, udata);
> -	if (ret)
> +	if (ret) {
> +		rdma_restrack_abort_del(&cq->res);
>  		return ret;
> +	}
>  
> -	rdma_restrack_del(&cq->res);
> +	rdma_restrack_finish_del(&cq->res);
>  	kfree(cq);
> -	return ret;
> +	return 0;
>  }
>  EXPORT_SYMBOL(ib_destroy_cq_user);
>  
> @@ -2126,16 +2137,19 @@ int ib_dereg_mr_user(struct ib_mr *mr, struct ib_udata *udata)
>  	int ret;
>  
>  	trace_mr_dereg(mr);
> -	rdma_restrack_del(&mr->res);
> +	rdma_restrack_prepare_del(&mr->res);
>  	ret = mr->device->ops.dereg_mr(mr, udata);
> -	if (!ret) {
> -		atomic_dec(&pd->usecnt);
> -		if (dm)
> -			atomic_dec(&dm->usecnt);
> -		kfree(sig_attrs);
> +	if (ret) {
> +		rdma_restrack_abort_del(&mr->res);
> +		return ret;
>  	}
>  
> -	return ret;
> +	rdma_restrack_finish_del(&mr->res);
> +	atomic_dec(&pd->usecnt);
> +	if (dm)
> +		atomic_dec(&dm->usecnt);
> +	kfree(sig_attrs);
> +	return 0;
>  }
>  EXPORT_SYMBOL(ib_dereg_mr_user);
>
Jason Gunthorpe April 25, 2021, 1:08 p.m. UTC | #7
On Sun, Apr 25, 2021 at 04:03:47PM +0300, Leon Romanovsky wrote:
> On Thu, Apr 22, 2021 at 11:29:39AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:
> > 
> > > I didn't understand when reviewed either, but decided to post it anyway
> > > to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> > > check.
> > 
> > I think the whole thing should look more like this and we delete the
> > if entirely.
> 
> I have mixed feelings about this approach. Before "destroy can fail disaster",
> the restrack goal was to provide the following flow:
> 1. create new memory object - rdma_restrack_new()
> 2. create new HW object - .create_XXX() callback in the driver
> 3. add HW object to the DB - rdma_restrack_del()
> ....
> 4. wait for any work on this HW object to complete - rdma_restrack_del()
> 5. safely destroy HW object - .destroy_XXX()
> 
> I really would like to stay with this flow and block any access to the
> object that failed to destruct - maybe add to some zombie list.

That isn't the semantic we now have for destroy.
 
> The proposed prepare/abort/finish flow is much harder to implement correctly.
> Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> but didn't restore them after .destroy_qp() failure.

I think it is a bug we call rdma_rw code in a a user path.

Jason
Leon Romanovsky April 25, 2021, 1:44 p.m. UTC | #8
On Sun, Apr 25, 2021 at 10:08:57AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 25, 2021 at 04:03:47PM +0300, Leon Romanovsky wrote:
> > On Thu, Apr 22, 2021 at 11:29:39AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:
> > > 
> > > > I didn't understand when reviewed either, but decided to post it anyway
> > > > to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> > > > check.
> > > 
> > > I think the whole thing should look more like this and we delete the
> > > if entirely.
> > 
> > I have mixed feelings about this approach. Before "destroy can fail disaster",
> > the restrack goal was to provide the following flow:
> > 1. create new memory object - rdma_restrack_new()
> > 2. create new HW object - .create_XXX() callback in the driver
> > 3. add HW object to the DB - rdma_restrack_del()
> > ....
> > 4. wait for any work on this HW object to complete - rdma_restrack_del()
> > 5. safely destroy HW object - .destroy_XXX()
> > 
> > I really would like to stay with this flow and block any access to the
> > object that failed to destruct - maybe add to some zombie list.
> 
> That isn't the semantic we now have for destroy.

I would say that it is my mistake introduced when changed destroy to
return an error.

>  
> > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > but didn't restore them after .destroy_qp() failure.
> 
> I think it is a bug we call rdma_rw code in a a user path.

It was an example of a flow that wasn't restored properly. 
The same goes for ib_dealloc_pd_user(), release of __internal_mr.

Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
from the code.

Thanks


> 
> Jason
Jason Gunthorpe April 25, 2021, 5:22 p.m. UTC | #9
On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > but didn't restore them after .destroy_qp() failure.
> > 
> > I think it is a bug we call rdma_rw code in a a user path.
> 
> It was an example of a flow that wasn't restored properly. 
> The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> 
> Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> from the code.

Well, exactly, user flows are not allowed to do extra stuff before
calling the driver destroy

So the arrangement I gave is reasonable and make sense, it is
certainly better than the hodge podge of ordering that we have today

Jason
Leon Romanovsky April 25, 2021, 5:38 p.m. UTC | #10
On Sun, Apr 25, 2021 at 02:22:54PM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > > but didn't restore them after .destroy_qp() failure.
> > > 
> > > I think it is a bug we call rdma_rw code in a a user path.
> > 
> > It was an example of a flow that wasn't restored properly. 
> > The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> > 
> > Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> > from the code.
> 
> Well, exactly, user flows are not allowed to do extra stuff before
> calling the driver destroy
> 
> So the arrangement I gave is reasonable and make sense, it is
> certainly better than the hodge podge of ordering that we have today

I thought about simpler solution - move rdma_restrack_del() before .destroy() 
callbacks together with attempt to readd res object if destroy fails.

Thanks

> 
> Jason
Jason Gunthorpe April 26, 2021, 12:03 p.m. UTC | #11
On Sun, Apr 25, 2021 at 08:38:57PM +0300, Leon Romanovsky wrote:
> On Sun, Apr 25, 2021 at 02:22:54PM -0300, Jason Gunthorpe wrote:
> > On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > > > but didn't restore them after .destroy_qp() failure.
> > > > 
> > > > I think it is a bug we call rdma_rw code in a a user path.
> > > 
> > > It was an example of a flow that wasn't restored properly. 
> > > The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> > > 
> > > Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> > > from the code.
> > 
> > Well, exactly, user flows are not allowed to do extra stuff before
> > calling the driver destroy
> > 
> > So the arrangement I gave is reasonable and make sense, it is
> > certainly better than the hodge podge of ordering that we have today
> 
> I thought about simpler solution - move rdma_restrack_del() before .destroy() 
> callbacks together with attempt to readd res object if destroy fails.

Is isn't simpler, now add can fail and can't be recovered

Jason
Leon Romanovsky April 26, 2021, 1:08 p.m. UTC | #12
On Mon, Apr 26, 2021 at 09:03:49AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 25, 2021 at 08:38:57PM +0300, Leon Romanovsky wrote:
> > On Sun, Apr 25, 2021 at 02:22:54PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > > > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > > > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > > > > but didn't restore them after .destroy_qp() failure.
> > > > > 
> > > > > I think it is a bug we call rdma_rw code in a a user path.
> > > > 
> > > > It was an example of a flow that wasn't restored properly. 
> > > > The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> > > > 
> > > > Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> > > > from the code.
> > > 
> > > Well, exactly, user flows are not allowed to do extra stuff before
> > > calling the driver destroy
> > > 
> > > So the arrangement I gave is reasonable and make sense, it is
> > > certainly better than the hodge podge of ordering that we have today
> > 
> > I thought about simpler solution - move rdma_restrack_del() before .destroy() 
> > callbacks together with attempt to readd res object if destroy fails.
> 
> Is isn't simpler, now add can fail and can't be recovered

It is not different from failure during first call to rdma_restrack_add().
You didn't like the idea to be strict with addition of restrack, but
want to be strict in reinsert.

Thanks

> 
> Jason
Jason Gunthorpe April 26, 2021, 1:11 p.m. UTC | #13
On Mon, Apr 26, 2021 at 04:08:42PM +0300, Leon Romanovsky wrote:
> On Mon, Apr 26, 2021 at 09:03:49AM -0300, Jason Gunthorpe wrote:
> > On Sun, Apr 25, 2021 at 08:38:57PM +0300, Leon Romanovsky wrote:
> > > On Sun, Apr 25, 2021 at 02:22:54PM -0300, Jason Gunthorpe wrote:
> > > > On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > > > > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > > > > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > > > > > but didn't restore them after .destroy_qp() failure.
> > > > > > 
> > > > > > I think it is a bug we call rdma_rw code in a a user path.
> > > > > 
> > > > > It was an example of a flow that wasn't restored properly. 
> > > > > The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> > > > > 
> > > > > Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> > > > > from the code.
> > > > 
> > > > Well, exactly, user flows are not allowed to do extra stuff before
> > > > calling the driver destroy
> > > > 
> > > > So the arrangement I gave is reasonable and make sense, it is
> > > > certainly better than the hodge podge of ordering that we have today
> > > 
> > > I thought about simpler solution - move rdma_restrack_del() before .destroy() 
> > > callbacks together with attempt to readd res object if destroy fails.
> > 
> > Is isn't simpler, now add can fail and can't be recovered
> 
> It is not different from failure during first call to rdma_restrack_add().
> You didn't like the idea to be strict with addition of restrack, but
> want to be strict in reinsert.

It is ugly we couldn't fix the add side, lets not repeat that uglyness
in other places

Jason
Leon Romanovsky April 27, 2021, 4:45 a.m. UTC | #14
On Mon, Apr 26, 2021 at 10:11:07AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 26, 2021 at 04:08:42PM +0300, Leon Romanovsky wrote:
> > On Mon, Apr 26, 2021 at 09:03:49AM -0300, Jason Gunthorpe wrote:
> > > On Sun, Apr 25, 2021 at 08:38:57PM +0300, Leon Romanovsky wrote:
> > > > On Sun, Apr 25, 2021 at 02:22:54PM -0300, Jason Gunthorpe wrote:
> > > > > On Sun, Apr 25, 2021 at 04:44:55PM +0300, Leon Romanovsky wrote:
> > > > > > > > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > > > > > > > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > > > > > > > but didn't restore them after .destroy_qp() failure.
> > > > > > > 
> > > > > > > I think it is a bug we call rdma_rw code in a a user path.
> > > > > > 
> > > > > > It was an example of a flow that wasn't restored properly. 
> > > > > > The same goes for ib_dealloc_pd_user(), release of __internal_mr.
> > > > > > 
> > > > > > Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
> > > > > > from the code.
> > > > > 
> > > > > Well, exactly, user flows are not allowed to do extra stuff before
> > > > > calling the driver destroy
> > > > > 
> > > > > So the arrangement I gave is reasonable and make sense, it is
> > > > > certainly better than the hodge podge of ordering that we have today
> > > > 
> > > > I thought about simpler solution - move rdma_restrack_del() before .destroy() 
> > > > callbacks together with attempt to readd res object if destroy fails.
> > > 
> > > Is isn't simpler, now add can fail and can't be recovered
> > 
> > It is not different from failure during first call to rdma_restrack_add().
> > You didn't like the idea to be strict with addition of restrack, but
> > want to be strict in reinsert.
> 
> It is ugly we couldn't fix the add side, lets not repeat that uglyness
> in other places

Why can't we fix _add?

Thanks

> 
> Jason
Leon Romanovsky May 2, 2021, 11:28 a.m. UTC | #15
On Thu, Apr 22, 2021 at 11:29:39AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:
> 
> > I didn't understand when reviewed either, but decided to post it anyway
> > to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> > check.
> 
> I think the whole thing should look more like this and we delete the
> if entirely.

Finally, I had a time to work on it and got immediate reminder why we
have "if ...".

> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 2b9ffc21cbc4ad..479b16b8f6723a 100644

<...>

> @@ -1963,31 +1969,33 @@ int ib_destroy_qp_user(struct ib_qp *qp, struct ib_udata *udata)
>  		rdma_rw_cleanup_mrs(qp);
>  
>  	rdma_counter_unbind_qp(qp, true);
> -	rdma_restrack_del(&qp->res);
> +	rdma_restrack_prepare_del(&qp->res);
>  	ret = qp->device->ops.destroy_qp(qp, udata);
> -	if (!ret) {
> -		if (alt_path_sgid_attr)
> -			rdma_put_gid_attr(alt_path_sgid_attr);
> -		if (av_sgid_attr)
> -			rdma_put_gid_attr(av_sgid_attr);
> -		if (pd)
> -			atomic_dec(&pd->usecnt);
> -		if (scq)
> -			atomic_dec(&scq->usecnt);
> -		if (rcq)
> -			atomic_dec(&rcq->usecnt);
> -		if (srq)
> -			atomic_dec(&srq->usecnt);
> -		if (ind_tbl)
> -			atomic_dec(&ind_tbl->usecnt);
> -		if (sec)
> -			ib_destroy_qp_security_end(sec);
> -	} else {
> +	if (ret) {
> +		rdma_restrack_abort_del(&qp->res);
>  		if (sec)
>  			ib_destroy_qp_security_abort(sec);
> +		return ret;
>  	}
>  
> -	return ret;
> +	rdma_restrack_finish_del(&qp->res);

This causes to use-after-free because "qp" is released in the driver.

[leonro@vm ~]$ mkt test
.s....[   24.350657] ==================================================================
[   24.350917] BUG: KASAN: use-after-free in rdma_restrack_finish_del+0x1a7/0x210 [ib_core]
[   24.351303] Read of size 1 at addr ffff88800c59f118 by task python3/243
[   24.351416] 
[   24.351495] CPU: 4 PID: 243 Comm: python3 Not tainted 5.12.0-rc2+ #2923
[   24.351644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   24.351832] Call Trace:
[   24.351886]  dump_stack+0x93/0xc2
[   24.351965]  print_address_description.constprop.0+0x18/0x140
[   24.352074]  ? rdma_restrack_finish_del+0x1a7/0x210 [ib_core]
[   24.352204]  ? rdma_restrack_finish_del+0x1a7/0x210 [ib_core]
[   24.352333]  kasan_report.cold+0x7c/0xd8
[   24.352413]  ? rdma_restrack_finish_del+0x1a7/0x210 [ib_core]
[   24.352543]  rdma_restrack_finish_del+0x1a7/0x210 [ib_core]
[   24.352654]  ib_destroy_qp_user+0x297/0x540 [ib_core]
[   24.352769]  uverbs_free_qp+0x7d/0x150 [ib_uverbs]
[   24.352864]  uverbs_destroy_uobject+0x164/0x6c0 [ib_uverbs]
[   24.352949]  uobj_destroy+0x72/0xf0 [ib_uverbs]
[   24.353041]  ib_uverbs_cmd_verbs+0x19b9/0x2ee0 [ib_uverbs]
[   24.353147]  ? uverbs_free_qp+0x150/0x150 [ib_uverbs]
[   24.353246]  ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
[   24.353340]  ? uverbs_fill_udata+0x540/0x540 [ib_uverbs]
[   24.353455]  ? filemap_map_pages+0x2d9/0xf30
[   24.353552]  ? lock_acquire+0x1a9/0x6d0
[   24.353620]  ? ib_uverbs_ioctl+0x11e/0x260 [ib_uverbs]
[   24.353720]  ? __might_fault+0xba/0x160
[   24.353790]  ? lock_release+0x6c0/0x6c0
[   24.353866]  ib_uverbs_ioctl+0x169/0x260 [ib_uverbs]
[   24.353958]  ? ib_uverbs_ioctl+0x11e/0x260 [ib_uverbs]
[   24.354049]  ? ib_uverbs_cmd_verbs+0x2ee0/0x2ee0 [ib_uverbs]
[   24.354158]  __x64_sys_ioctl+0x7e6/0x1050
[   24.354217]  ? generic_block_fiemap+0x60/0x60
[   24.354305]  ? down_write_nested+0x150/0x150
[   24.354400]  ? do_user_addr_fault+0x219/0xdc0
[   24.354497]  ? lockdep_hardirqs_on_prepare+0x273/0x3e0
[   24.354584]  ? syscall_enter_from_user_mode+0x1d/0x50
[   24.354678]  do_syscall_64+0x2d/0x40
[   24.354747]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   24.354837] RIP: 0033:0x7fbfd862b5db
[   24.354911] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d b8 0c 00 f7 d8 64 89 01 48
[   24.355200] RSP: 002b:00007fff7e9fde58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   24.355336] RAX: ffffffffffffffda RBX: 00007fff7e9fdf88 RCX: 00007fbfd862b5db
[   24.355468] RDX: 00007fff7e9fdf70 RSI: 00000000c0181b01 RDI: 0000000000000005
[   24.355587] RBP: 00007fff7e9fdf50 R08: 000055d6a4f8b3f0 R09: 00007fbfd800d000
[   24.355713] R10: 00007fbfd800d0f0 R11: 0000000000000246 R12: 00007fff7e9fdf50
[   24.355836] R13: 00007fff7e9fdf2c R14: 000055d6a4e59d90 R15: 000055d6a4f8b530
[   24.355973] 

So let's finish memory allocation conversion patches before.

Thanks
diff mbox series

Patch

diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
index ffabaf327242..def0c5b0efe9 100644
--- a/drivers/infiniband/core/restrack.c
+++ b/drivers/infiniband/core/restrack.c
@@ -340,7 +340,7 @@  void rdma_restrack_del(struct rdma_restrack_entry *res)
 	rt = &dev->res[res->type];
 
 	old = xa_erase(&rt->xa, res->id);
-	if (res->type == RDMA_RESTRACK_MR || res->type == RDMA_RESTRACK_QP)
+	if (res->type == RDMA_RESTRACK_MR)
 		return;
 	WARN_ON(old != res);