diff mbox series

[V2,for-4.21,1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

Message ID 20181116112311.4117-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: fix kobject lifetime issue | expand

Commit Message

Ming Lei Nov. 16, 2018, 11:23 a.m. UTC
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs, so each kobject's lifetime has
to be independent, then the objects(mq_kobj, ctx) which hosts its
own kobject have to be allocated dynamically.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sysfs.c   | 41 +++++++++++++++++++++++++------------
 block/blk-mq.c         | 55 +++++++++++++++++++++++++++++++++++++++++---------
 block/blk-mq.h         |  4 ++--
 include/linux/blkdev.h |  4 ++--
 4 files changed, 78 insertions(+), 26 deletions(-)

Comments

Greg Kroah-Hartman Nov. 16, 2018, 2:05 p.m. UTC | #1
On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> @@ -456,7 +456,7 @@ struct request_queue {
>  	/*
>  	 * mq queue kobject
>  	 */
> -	struct kobject mq_kobj;
> +	struct kobject *mq_kobj;

What is this kobject even used for?  It wasn't obvious at all from this
patch, why is it needed if you are not using it to reference count the
larger structure here?

thanks,

greg k-h
Ming Lei Nov. 17, 2018, 2:26 a.m. UTC | #2
On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote:
> On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> > @@ -456,7 +456,7 @@ struct request_queue {
> >  	/*
> >  	 * mq queue kobject
> >  	 */
> > -	struct kobject mq_kobj;
> > +	struct kobject *mq_kobj;
> 
> What is this kobject even used for?  It wasn't obvious at all from this
> patch, why is it needed if you are not using it to reference count the
> larger structure here?

All attributes and kobjects under /sys/block/$DEV/mq are covered by this kobject
actually, and all are for exposing blk-mq specific information, but now there is
only blk-mq, and legacy io path is removed.

That is why I mentioned we may remove this kobject last time and move all under
/sys/block/$DEV/queue, however you thought that may break some userspace.

If we want to backport them to stable, this patch may be a bit easier to go.

Thanks,
Ming
Greg Kroah-Hartman Nov. 19, 2018, 10:06 a.m. UTC | #3
On Sat, Nov 17, 2018 at 10:26:38AM +0800, Ming Lei wrote:
> On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote:
> > On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> > > @@ -456,7 +456,7 @@ struct request_queue {
> > >  	/*
> > >  	 * mq queue kobject
> > >  	 */
> > > -	struct kobject mq_kobj;
> > > +	struct kobject *mq_kobj;
> > 
> > What is this kobject even used for?  It wasn't obvious at all from this
> > patch, why is it needed if you are not using it to reference count the
> > larger structure here?
> 
> All attributes and kobjects under /sys/block/$DEV/mq are covered by this kobject
> actually, and all are for exposing blk-mq specific information, but now there is
> only blk-mq, and legacy io path is removed.

I am sorry, but I really can not parse this sentance at all.

What Documentation/ABI/ entries are covered by this kobject, that should
help me out more.  And what do you mean by "legacy io"?

> That is why I mentioned we may remove this kobject last time and move all under
> /sys/block/$DEV/queue, however you thought that may break some userspace.

Who relies on these sysfs files today?

> If we want to backport them to stable, this patch may be a bit easier to go.

Why do you want to backport any of this to stable?

thanks,

greg k-h
Ming Lei Nov. 19, 2018, 10:20 a.m. UTC | #4
On Mon, Nov 19, 2018 at 11:06:06AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Nov 17, 2018 at 10:26:38AM +0800, Ming Lei wrote:
> > On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> > > > @@ -456,7 +456,7 @@ struct request_queue {
> > > >  	/*
> > > >  	 * mq queue kobject
> > > >  	 */
> > > > -	struct kobject mq_kobj;
> > > > +	struct kobject *mq_kobj;
> > > 
> > > What is this kobject even used for?  It wasn't obvious at all from this
> > > patch, why is it needed if you are not using it to reference count the
> > > larger structure here?
> > 
> > All attributes and kobjects under /sys/block/$DEV/mq are covered by this kobject
> > actually, and all are for exposing blk-mq specific information, but now there is
> > only blk-mq, and legacy io path is removed.
> 
> I am sorry, but I really can not parse this sentance at all.
> 
> What Documentation/ABI/ entries are covered by this kobject, that should
> help me out more.  And what do you mean by "legacy io"?

After blk-mq is introduced, there are two main IO request paths:

1) blk-mq, IO is queued via blk_mq_make_request()

2) legacy IO path, IO is queued via blk_queue_bio()

The legacy IO path will be removed in 4.21, and patches have been queued
in for-4.21/block.

> 
> > That is why I mentioned we may remove this kobject last time and move all under
> > /sys/block/$DEV/queue, however you thought that may break some userspace.
> 
> Who relies on these sysfs files today?

I don't know, actually the question is from you, :-(

https://marc.info/?l=linux-kernel&m=154205455332755&w=2

Even we remove q->mq_kobj, the same kobject lifetime issue is still here in
kobjects embedded in 'struct blk_mq_ctx'.

> 
> > If we want to backport them to stable, this patch may be a bit easier to go.
> 
> Why do you want to backport any of this to stable?

For this report from Guenter, it should be enough to backport the 1st patch,
and it shouldn't be very hard to backport both. 

I can help to backport them if patches can't be applied cleanly.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3d25b9c419e9..cc2fef909afc 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -15,6 +15,14 @@ 
 
 static void blk_mq_sysfs_release(struct kobject *kobj)
 {
+	kfree(kobj);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+	struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+	kfree(ctx);
 }
 
 static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -213,7 +221,7 @@  static struct kobj_type blk_mq_ktype = {
 static struct kobj_type blk_mq_ctx_ktype = {
 	.sysfs_ops	= &blk_mq_sysfs_ops,
 	.default_attrs	= default_ctx_attrs,
-	.release	= blk_mq_sysfs_release,
+	.release	= blk_mq_ctx_sysfs_release,
 };
 
 static struct kobj_type blk_mq_hw_ktype = {
@@ -245,7 +253,7 @@  static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
 	if (!hctx->nr_ctx)
 		return 0;
 
-	ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
+	ret = kobject_add(&hctx->kobj, q->mq_kobj, "%u", hctx->queue_num);
 	if (ret)
 		return ret;
 
@@ -268,8 +276,8 @@  void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
 
-	kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
-	kobject_del(&q->mq_kobj);
+	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+	kobject_del(q->mq_kobj);
 	kobject_put(&dev->kobj);
 
 	q->mq_sysfs_init_done = false;
@@ -286,23 +294,30 @@  void blk_mq_sysfs_deinit(struct request_queue *q)
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		ctx = per_cpu_ptr(q->queue_ctx, cpu);
+		ctx = *per_cpu_ptr(q->queue_ctx, cpu);
 		kobject_put(&ctx->kobj);
 	}
-	kobject_put(&q->mq_kobj);
+	kobject_put(q->mq_kobj);
 }
 
-void blk_mq_sysfs_init(struct request_queue *q)
+int blk_mq_sysfs_init(struct request_queue *q)
 {
 	struct blk_mq_ctx *ctx;
 	int cpu;
+	struct kobject *mq_kobj;
+
+	mq_kobj = kzalloc(sizeof(*mq_kobj), GFP_KERNEL);
+	if (!mq_kobj)
+		return -ENOMEM;
 
-	kobject_init(&q->mq_kobj, &blk_mq_ktype);
+	kobject_init(mq_kobj, &blk_mq_ktype);
 
 	for_each_possible_cpu(cpu) {
-		ctx = per_cpu_ptr(q->queue_ctx, cpu);
+		ctx = *per_cpu_ptr(q->queue_ctx, cpu);
 		kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
 	}
+	q->mq_kobj = mq_kobj;
+	return 0;
 }
 
 int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
@@ -313,11 +328,11 @@  int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 	WARN_ON_ONCE(!q->kobj.parent);
 	lockdep_assert_held(&q->sysfs_lock);
 
-	ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
+	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
 	if (ret < 0)
 		goto out;
 
-	kobject_uevent(&q->mq_kobj, KOBJ_ADD);
+	kobject_uevent(q->mq_kobj, KOBJ_ADD);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_register_hctx(hctx);
@@ -334,8 +349,8 @@  int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 	while (--i >= 0)
 		blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
 
-	kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
-	kobject_del(&q->mq_kobj);
+	kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+	kobject_del(q->mq_kobj);
 	kobject_put(&dev->kobj);
 	return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3b823891b3ef..376c04778d33 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2299,7 +2299,7 @@  static void blk_mq_init_cpu_queues(struct request_queue *q,
 	unsigned int i, j;
 
 	for_each_possible_cpu(i) {
-		struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
+		struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_ctx, i);
 		struct blk_mq_hw_ctx *hctx;
 
 		__ctx->cpu = i;
@@ -2385,7 +2385,7 @@  static void blk_mq_map_swqueue(struct request_queue *q)
 			set->map[0].mq_map[i] = 0;
 		}
 
-		ctx = per_cpu_ptr(q->queue_ctx, i);
+		ctx = *per_cpu_ptr(q->queue_ctx, i);
 		for (j = 0; j < set->nr_maps; j++) {
 			hctx = blk_mq_map_queue_type(q, j, i);
 
@@ -2515,6 +2515,38 @@  static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	mutex_unlock(&set->tag_list_lock);
 }
 
+static void blk_mq_dealloc_queue_ctx(struct request_queue *q, bool free_ctxs)
+{
+	if (free_ctxs) {
+		int cpu;
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(q->queue_ctx, cpu));
+	}
+	free_percpu(q->queue_ctx);
+}
+
+static int blk_mq_alloc_queue_ctx(struct request_queue *q)
+{
+	struct blk_mq_ctx *ctx;
+	int cpu;
+
+	q->queue_ctx = alloc_percpu(struct blk_mq_ctx *);
+	if (!q->queue_ctx)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
+		if (!ctx)
+			goto fail;
+		*per_cpu_ptr(q->queue_ctx, cpu) = ctx;
+	}
+
+	return 0;
+ fail:
+	blk_mq_dealloc_queue_ctx(q, true);
+	return -ENOMEM;
+}
+
 /*
  * It is the actual release handler for mq, but we do it from
  * request queue's release handler for avoiding use-after-free
@@ -2541,7 +2573,7 @@  void blk_mq_release(struct request_queue *q)
 	 */
 	blk_mq_sysfs_deinit(q);
 
-	free_percpu(q->queue_ctx);
+	blk_mq_dealloc_queue_ctx(q, false);
 }
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
@@ -2722,6 +2754,8 @@  static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
+	bool sysfs_init_done = false;
+
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
 
@@ -2731,18 +2765,19 @@  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->poll_cb)
 		goto err_exit;
 
-	q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
-	if (!q->queue_ctx)
+	if (blk_mq_alloc_queue_ctx(q))
 		goto err_exit;
 
 	/* init q->mq_kobj and sw queues' kobjects */
-	blk_mq_sysfs_init(q);
+	if (blk_mq_sysfs_init(q))
+		goto err_queue_ctx;
 
+	sysfs_init_done = true;
 	q->nr_queues = nr_hw_queues(set);
 	q->queue_hw_ctx = kcalloc_node(q->nr_queues, sizeof(*(q->queue_hw_ctx)),
 						GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
-		goto err_percpu;
+		goto err_sys_init;
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
@@ -2794,8 +2829,10 @@  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 err_hctxs:
 	kfree(q->queue_hw_ctx);
-err_percpu:
-	free_percpu(q->queue_ctx);
+err_sys_init:
+	blk_mq_sysfs_deinit(q);
+err_queue_ctx:
+	blk_mq_dealloc_queue_ctx(q, !sysfs_init_done);
 err_exit:
 	q->mq_ops = NULL;
 	return ERR_PTR(-ENOMEM);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index facb6e9ddce4..84898793c230 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -108,7 +108,7 @@  static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 /*
  * sysfs helpers
  */
-extern void blk_mq_sysfs_init(struct request_queue *q);
+extern int blk_mq_sysfs_init(struct request_queue *q);
 extern void blk_mq_sysfs_deinit(struct request_queue *q);
 extern int __blk_mq_register_dev(struct device *dev, struct request_queue *q);
 extern int blk_mq_sysfs_register(struct request_queue *q);
@@ -129,7 +129,7 @@  static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
-	return per_cpu_ptr(q->queue_ctx, cpu);
+	return *per_cpu_ptr(q->queue_ctx, cpu);
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1d185f1fc333..9e3892bd67fd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -407,7 +407,7 @@  struct request_queue {
 	const struct blk_mq_ops	*mq_ops;
 
 	/* sw queues */
-	struct blk_mq_ctx __percpu	*queue_ctx;
+	struct blk_mq_ctx __percpu	**queue_ctx;
 	unsigned int		nr_queues;
 
 	unsigned int		queue_depth;
@@ -456,7 +456,7 @@  struct request_queue {
 	/*
 	 * mq queue kobject
 	 */
-	struct kobject mq_kobj;
+	struct kobject *mq_kobj;
 
 #ifdef  CONFIG_BLK_DEV_INTEGRITY
 	struct blk_integrity integrity;