diff mbox series

[04/15] block: prevent elevator switch during updating nr_hw_queues

Message ID 20250410133029.2487054-5-ming.lei@redhat.com (mailing list archive)
State New
Headers show
Series block: unify elevator changing and fix lockdep warning | expand

Commit Message

Ming Lei April 10, 2025, 1:30 p.m. UTC
updating nr_hw_queues is usually used for error handling code, when it
doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
may change, and elevator tags depends on nr_hw_queues.

Prevent elevator switch during updating nr_hw_queues by setting flag of
BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
the period. Here elevator switch code is srcu reader of nr_hw_queues,
and blk_mq_update_nr_hw_queues() is the writer.

This way avoids lot of trouble.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq.c         | 19 ++++++++++++++++++-
 block/elevator.c       | 12 +++++++++++-
 include/linux/blk-mq.h | 10 +++++++++-
 4 files changed, 39 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig April 10, 2025, 2:36 p.m. UTC | #1
On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote:
> updating nr_hw_queues is usually used for error handling code, when it

Capitalize the first word of each sentence, please.

> doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
> may change, and elevator tags depends on nr_hw_queues.

I don't think it's really updated from error handling

 - nbd does it when starting a device
 - nullb can do it through debugfs
 - xen-blkfront does it when resuming from a suspend
 - nvme does it when resetting a controller.  While error handling
   can escalate to it¸ it's basically probing and re-probing code

> Prevent elevator switch during updating nr_hw_queues by setting flag of
> BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
> the period. Here elevator switch code is srcu reader of nr_hw_queues,
> and blk_mq_update_nr_hw_queues() is the writer.

That being said as we generally are in a setup path I think the general
idea is fine.  No devices should be life yet at this point and thus
no udev rules changing the scheduler should run yet.

> This way avoids lot of trouble.

Can you spell that out a bit?

> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/

Are we using Closes for bug reports now?  I haven't really seen that
anywhere.

>  out_cleanup_srcu:
>  	if (set->flags & BLK_MQ_F_BLOCKING)
>  		cleanup_srcu_struct(set->srcu);
> @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  {
>  	mutex_lock(&set->tag_list_lock);
> +	/*
> +	 * Mark us in updating nr_hw_queues for preventing switching
> +	 * elevator
>
> +	 *
> +	 * Elevator switch code can _not_ acquire ->tag_list_lock

Please add a . at the end of a sentences.  Also this should probably
be something like "Mark us as in.." but I'll leave more nitpicking
to the native speakers.

>  	struct request_queue *q = disk->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  
>  	/*
>  	 * If the attribute needs to load a module, do it before freezing the
> @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  
>  	elv_iosched_load_module(name);
>  
> +	idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> +
> +	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {

What provides atomicity for field modifications vs reading of set->flags?
i.e. does this need to switch using test/set_bit?

> +	struct srcu_struct	update_nr_hwq_srcu;
>  };
>  
>  /**
> @@ -681,7 +682,14 @@ enum {
>  	 */
>  	BLK_MQ_F_NO_SCHED_BY_DEFAULT	= 1 << 6,
>  
> -	BLK_MQ_F_MAX = 1 << 7,
> +	/*
> +	 * True when updating nr_hw_queues is in-progress
> +	 *
> +	 * tag_set only flag, not usable for hctx
> +	 */
> +	BLK_MQ_F_UPDATE_HW_QUEUES	= 1 << 7,
> +
> +	BLK_MQ_F_MAX = 1 << 8,

Also mixing internal state with driver provided flags is always
a bad idea.  So this should probably be a new state field in the
tag_set and not reuse flags.
Nilay Shroff April 11, 2025, 7:13 p.m. UTC | #2
On 4/10/25 7:00 PM, Ming Lei wrote:
> @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  {
>  	mutex_lock(&set->tag_list_lock);
> +	/*
> +	 * Mark us in updating nr_hw_queues for preventing switching
> +	 * elevator
> +	 *
> +	 * Elevator switch code can _not_ acquire ->tag_list_lock
> +	 */
> +	set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
> +	synchronize_srcu(&set->update_nr_hwq_srcu);
> +
>  	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
> +
> +	set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;

I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to
have set->flags updated as below:

set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES;

Thanks,
--Nilay
Ming Lei April 14, 2025, 12:54 a.m. UTC | #3
On Thu, Apr 10, 2025 at 04:36:22PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote:
> > updating nr_hw_queues is usually used for error handling code, when it
> 
> Capitalize the first word of each sentence, please.
> 
> > doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
> > may change, and elevator tags depends on nr_hw_queues.
> 
> I don't think it's really updated from error handling

NVMe does use it in error handling. I can remove error handling words, but
the trouble doesn't change.

> 
>  - nbd does it when starting a device
>  - nullb can do it through debugfs
>  - xen-blkfront does it when resuming from a suspend
>  - nvme does it when resetting a controller.  While error handling
>    can escalate to it¸ it's basically probing and re-probing code

reset is part of error handling.

> 
> > Prevent elevator switch during updating nr_hw_queues by setting flag of
> > BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
> > the period. Here elevator switch code is srcu reader of nr_hw_queues,
> > and blk_mq_update_nr_hw_queues() is the writer.
> 
> That being said as we generally are in a setup path I think the general
> idea is fine.  No devices should be life yet at this point and thus
> no udev rules changing the scheduler should run yet.
> 
> > This way avoids lot of trouble.
> 
> Can you spell that out a bit?

Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/

> 
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> 
> Are we using Closes for bug reports now?  I haven't really seen that
> anywhere.

The blktests block/039 isn't merged yet, and the patch is posted recently.

kernel panic and kasan is triggered in this test.

> 
> >  out_cleanup_srcu:
> >  	if (set->flags & BLK_MQ_F_BLOCKING)
> >  		cleanup_srcu_struct(set->srcu);
> > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> >  {
> >  	mutex_lock(&set->tag_list_lock);
> > +	/*
> > +	 * Mark us in updating nr_hw_queues for preventing switching
> > +	 * elevator
> >
> > +	 *
> > +	 * Elevator switch code can _not_ acquire ->tag_list_lock
> 
> Please add a . at the end of a sentences.  Also this should probably
> be something like "Mark us as in.." but I'll leave more nitpicking
> to the native speakers.

OK.

> 
> >  	struct request_queue *q = disk->queue;
> > +	struct blk_mq_tag_set *set = q->tag_set;
> >  
> >  	/*
> >  	 * If the attribute needs to load a module, do it before freezing the
> > @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >  
> >  	elv_iosched_load_module(name);
> >  
> > +	idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > +
> > +	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
> 
> What provides atomicity for field modifications vs reading of set->flags?
> i.e. does this need to switch using test/set_bit?

WRITE is serialized via tag set lock with synchronize_srcu().

READ is covered by srcu read lock.

It is typical RCU usage, one writer vs. multiple writer.

> 
> > +	struct srcu_struct	update_nr_hwq_srcu;
> >  };
> >  
> >  /**
> > @@ -681,7 +682,14 @@ enum {
> >  	 */
> >  	BLK_MQ_F_NO_SCHED_BY_DEFAULT	= 1 << 6,
> >  
> > -	BLK_MQ_F_MAX = 1 << 7,
> > +	/*
> > +	 * True when updating nr_hw_queues is in-progress
> > +	 *
> > +	 * tag_set only flag, not usable for hctx
> > +	 */
> > +	BLK_MQ_F_UPDATE_HW_QUEUES	= 1 << 7,
> > +
> > +	BLK_MQ_F_MAX = 1 << 8,
> 
> Also mixing internal state with driver provided flags is always
> a bad idea.  So this should probably be a new state field in the
> tag_set and not reuse flags.
 
That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too.

thanks, 
Ming
Ming Lei April 14, 2025, 12:55 a.m. UTC | #4
On Sat, Apr 12, 2025 at 12:43:10AM +0530, Nilay Shroff wrote:
> 
> 
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> >  {
> >  	mutex_lock(&set->tag_list_lock);
> > +	/*
> > +	 * Mark us in updating nr_hw_queues for preventing switching
> > +	 * elevator
> > +	 *
> > +	 * Elevator switch code can _not_ acquire ->tag_list_lock
> > +	 */
> > +	set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
> > +	synchronize_srcu(&set->update_nr_hwq_srcu);
> > +
> >  	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
> > +
> > +	set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
> 
> I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to
> have set->flags updated as below:
> 
> set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES;

Good catch!


Thanks,
Ming
Christoph Hellwig April 14, 2025, 6:07 a.m. UTC | #5
On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote:
> > >  	elv_iosched_load_module(name);
> > >  
> > > +	idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > > +
> > > +	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
> > 
> > What provides atomicity for field modifications vs reading of set->flags?
> > i.e. does this need to switch using test/set_bit?
> 
> WRITE is serialized via tag set lock with synchronize_srcu().
> 
> READ is covered by srcu read lock.
> 
> It is typical RCU usage, one writer vs. multiple writer.

No, (S)RCU does not help you with atomicy of bitfields.

> > Also mixing internal state with driver provided flags is always
> > a bad idea.  So this should probably be a new state field in the
> > tag_set and not reuse flags.
>  
> That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too.

Yes, that should also move over to the state field.  Amd rnbd needs
to be fixed to not set it diretly which is a good example for why
they should not be mixed..
Ming Lei April 15, 2025, 2:03 a.m. UTC | #6
On Mon, Apr 14, 2025 at 08:07:54AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote:
> > > >  	elv_iosched_load_module(name);
> > > >  
> > > > +	idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > > > +
> > > > +	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
> > > 
> > > What provides atomicity for field modifications vs reading of set->flags?
> > > i.e. does this need to switch using test/set_bit?
> > 
> > WRITE is serialized via tag set lock with synchronize_srcu().
> > 
> > READ is covered by srcu read lock.
> > 
> > It is typical RCU usage, one writer vs. multiple writer.
> 
> No, (S)RCU does not help you with atomicy of bitfields.

OK, will change it into one state variable.


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c308699ded58..27f984311bb7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -180,6 +180,7 @@  static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(TAG_RR),
 	HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT),
+	HCTX_FLAG_NAME(UPDATE_HW_QUEUES),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d7a103dc258b..4b0707fb7ae3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4785,12 +4785,16 @@  int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 			goto out_free_srcu;
 	}
 
+	ret = init_srcu_struct(&set->update_nr_hwq_srcu);
+	if (ret)
+		goto out_cleanup_srcu;
+
 	ret = -ENOMEM;
 	set->tags = kcalloc_node(set->nr_hw_queues,
 				 sizeof(struct blk_mq_tags *), GFP_KERNEL,
 				 set->numa_node);
 	if (!set->tags)
-		goto out_cleanup_srcu;
+		goto out_cleanup_hwq_srcu;
 
 	for (i = 0; i < set->nr_maps; i++) {
 		set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
@@ -4819,6 +4823,8 @@  int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	}
 	kfree(set->tags);
 	set->tags = NULL;
+out_cleanup_hwq_srcu:
+	cleanup_srcu_struct(&set->update_nr_hwq_srcu);
 out_cleanup_srcu:
 	if (set->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(set->srcu);
@@ -5081,7 +5087,18 @@  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	mutex_lock(&set->tag_list_lock);
+	/*
+	 * Mark us in updating nr_hw_queues for preventing switching
+	 * elevator
+	 *
+	 * Elevator switch code can _not_ acquire ->tag_list_lock
+	 */
+	set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
+	synchronize_srcu(&set->update_nr_hwq_srcu);
+
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
+
+	set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
 	mutex_unlock(&set->tag_list_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/block/elevator.c b/block/elevator.c
index cf48613c6e62..7d7b77dd4341 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -718,9 +718,10 @@  ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 {
 	char elevator_name[ELV_NAME_MAX];
 	char *name;
-	int ret;
+	int ret, idx;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	/*
 	 * If the attribute needs to load a module, do it before freezing the
@@ -732,6 +733,13 @@  ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 
 	elv_iosched_load_module(name);
 
+	idx = srcu_read_lock(&set->update_nr_hwq_srcu);
+
+	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
+		ret = -EBUSY;
+		goto exit;
+	}
+
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
@@ -739,6 +747,8 @@  ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = count;
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+exit:
+	srcu_read_unlock(&set->update_nr_hwq_srcu, idx);
 	return ret;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..473871c760e1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -527,6 +527,7 @@  struct blk_mq_tag_set {
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
+	struct srcu_struct	update_nr_hwq_srcu;
 };
 
 /**
@@ -681,7 +682,14 @@  enum {
 	 */
 	BLK_MQ_F_NO_SCHED_BY_DEFAULT	= 1 << 6,
 
-	BLK_MQ_F_MAX = 1 << 7,
+	/*
+	 * True when updating nr_hw_queues is in-progress
+	 *
+	 * tag_set only flag, not usable for hctx
+	 */
+	BLK_MQ_F_UPDATE_HW_QUEUES	= 1 << 7,
+
+	BLK_MQ_F_MAX = 1 << 8,
 };
 
 #define BLK_MQ_MAX_DEPTH	(10240)