Message ID | 20250410133029.2487054-5-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: unify elevator changing and fix lockdep warning | expand |
On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote: > updating nr_hw_queues is usually used for error handling code, when it Capitalize the first word of each sentence, please. > doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues > may change, and elevator tags depends on nr_hw_queues. I don't think it's really updated from error handling - nbd does it when starting a device - nullb can do it through debugfs - xen-blkfront does it when resuming from a suspend - nvme does it when resetting a controller. While error handling can escalate to it¸ it's basically probing and re-probing code > Prevent elevator switch during updating nr_hw_queues by setting flag of > BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during > the period. Here elevator switch code is srcu reader of nr_hw_queues, > and blk_mq_update_nr_hw_queues() is the writer. That being said as we generally are in a setup path I think the general idea is fine. No devices should be life yet at this point and thus no udev rules changing the scheduler should run yet. > This way avoids lot of trouble. Can you spell that out a bit? > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ Are we using Closes for bug reports now? I haven't really seen that anywhere. > out_cleanup_srcu: > if (set->flags & BLK_MQ_F_BLOCKING) > cleanup_srcu_struct(set->srcu); > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) > { > mutex_lock(&set->tag_list_lock); > + /* > + * Mark us in updating nr_hw_queues for preventing switching > + * elevator > > + * > + * Elevator switch code can _not_ acquire ->tag_list_lock Please add a . at the end of a sentences. Also this should probably be something like "Mark us as in.." but I'll leave more nitpicking to the native speakers. > struct request_queue *q = disk->queue; > + struct blk_mq_tag_set *set = q->tag_set; > > /* > * If the attribute needs to load a module, do it before freezing the > @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, > > elv_iosched_load_module(name); > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu); > + > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { What provides atomicity for field modifications vs reading of set->flags? i.e. does this need to switch using test/set_bit? > + struct srcu_struct update_nr_hwq_srcu; > }; > > /** > @@ -681,7 +682,14 @@ enum { > */ > BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6, > > - BLK_MQ_F_MAX = 1 << 7, > + /* > + * True when updating nr_hw_queues is in-progress > + * > + * tag_set only flag, not usable for hctx > + */ > + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7, > + > + BLK_MQ_F_MAX = 1 << 8, Also mixing internal state with driver provided flags is always a bad idea. So this should probably be a new state field in the tag_set and not reuse flags.
On 4/10/25 7:00 PM, Ming Lei wrote: > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) > { > mutex_lock(&set->tag_list_lock); > + /* > + * Mark us in updating nr_hw_queues for preventing switching > + * elevator > + * > + * Elevator switch code can _not_ acquire ->tag_list_lock > + */ > + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES; > + synchronize_srcu(&set->update_nr_hwq_srcu); > + > __blk_mq_update_nr_hw_queues(set, nr_hw_queues); > + > + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES; I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to have set->flags updated as below: set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES; Thanks, --Nilay
On Thu, Apr 10, 2025 at 04:36:22PM +0200, Christoph Hellwig wrote: > On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote: > > updating nr_hw_queues is usually used for error handling code, when it > > Capitalize the first word of each sentence, please. > > > doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues > > may change, and elevator tags depends on nr_hw_queues. > > I don't think it's really updated from error handling NVMe does use it in error handling. I can remove error handling words, but the trouble doesn't change. > > - nbd does it when starting a device > - nullb can do it through debugfs > - xen-blkfront does it when resuming from a suspend > - nvme does it when resetting a controller. While error handling > can escalate to it¸ it's basically probing and re-probing code reset is part of error handling. > > > Prevent elevator switch during updating nr_hw_queues by setting flag of > > BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during > > the period. Here elevator switch code is srcu reader of nr_hw_queues, > > and blk_mq_update_nr_hw_queues() is the writer. > > That being said as we generally are in a setup path I think the general > idea is fine. No devices should be life yet at this point and thus > no udev rules changing the scheduler should run yet. > > > This way avoids lot of trouble. > > Can you spell that out a bit? Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ > > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ > > Are we using Closes for bug reports now? I haven't really seen that > anywhere. The blktests block/039 isn't merged yet, and the patch is posted recently. kernel panic and kasan is triggered in this test. > > > out_cleanup_srcu: > > if (set->flags & BLK_MQ_F_BLOCKING) > > cleanup_srcu_struct(set->srcu); > > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) > > { > > mutex_lock(&set->tag_list_lock); > > + /* > > + * Mark us in updating nr_hw_queues for preventing switching > > + * elevator > > > > + * > > + * Elevator switch code can _not_ acquire ->tag_list_lock > > Please add a . at the end of a sentences. Also this should probably > be something like "Mark us as in.." but I'll leave more nitpicking > to the native speakers. OK. > > > struct request_queue *q = disk->queue; > > + struct blk_mq_tag_set *set = q->tag_set; > > > > /* > > * If the attribute needs to load a module, do it before freezing the > > @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, > > > > elv_iosched_load_module(name); > > > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu); > > + > > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { > > What provides atomicity for field modifications vs reading of set->flags? > i.e. does this need to switch using test/set_bit? WRITE is serialized via tag set lock with synchronize_srcu(). READ is covered by srcu read lock. It is typical RCU usage, one writer vs. multiple writer. > > > + struct srcu_struct update_nr_hwq_srcu; > > }; > > > > /** > > @@ -681,7 +682,14 @@ enum { > > */ > > BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6, > > > > - BLK_MQ_F_MAX = 1 << 7, > > + /* > > + * True when updating nr_hw_queues is in-progress > > + * > > + * tag_set only flag, not usable for hctx > > + */ > > + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7, > > + > > + BLK_MQ_F_MAX = 1 << 8, > > Also mixing internal state with driver provided flags is always > a bad idea. So this should probably be a new state field in the > tag_set and not reuse flags. That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too. thanks, Ming
On Sat, Apr 12, 2025 at 12:43:10AM +0530, Nilay Shroff wrote: > > > On 4/10/25 7:00 PM, Ming Lei wrote: > > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) > > { > > mutex_lock(&set->tag_list_lock); > > + /* > > + * Mark us in updating nr_hw_queues for preventing switching > > + * elevator > > + * > > + * Elevator switch code can _not_ acquire ->tag_list_lock > > + */ > > + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES; > > + synchronize_srcu(&set->update_nr_hwq_srcu); > > + > > __blk_mq_update_nr_hw_queues(set, nr_hw_queues); > > + > > + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES; > > I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to > have set->flags updated as below: > > set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES; Good catch! Thanks, Ming
On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote: > > > elv_iosched_load_module(name); > > > > > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu); > > > + > > > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { > > > > What provides atomicity for field modifications vs reading of set->flags? > > i.e. does this need to switch using test/set_bit? > > WRITE is serialized via tag set lock with synchronize_srcu(). > > READ is covered by srcu read lock. > > It is typical RCU usage, one writer vs. multiple writer. No, (S)RCU does not help you with atomicy of bitfields. > > Also mixing internal state with driver provided flags is always > > a bad idea. So this should probably be a new state field in the > > tag_set and not reuse flags. > > That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too. Yes, that should also move over to the state field. Amd rnbd needs to be fixed to not set it diretly which is a good example for why they should not be mixed..
On Mon, Apr 14, 2025 at 08:07:54AM +0200, Christoph Hellwig wrote: > On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote: > > > > elv_iosched_load_module(name); > > > > > > > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu); > > > > + > > > > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { > > > > > > What provides atomicity for field modifications vs reading of set->flags? > > > i.e. does this need to switch using test/set_bit? > > > > WRITE is serialized via tag set lock with synchronize_srcu(). > > > > READ is covered by srcu read lock. > > > > It is typical RCU usage, one writer vs. multiple writer. > > No, (S)RCU does not help you with atomicy of bitfields. OK, will change it into one state variable. Thanks, Ming
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index c308699ded58..27f984311bb7 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -180,6 +180,7 @@ static const char *const hctx_flag_name[] = { HCTX_FLAG_NAME(BLOCKING), HCTX_FLAG_NAME(TAG_RR), HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT), + HCTX_FLAG_NAME(UPDATE_HW_QUEUES), }; #undef HCTX_FLAG_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index d7a103dc258b..4b0707fb7ae3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4785,12 +4785,16 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) goto out_free_srcu; } + ret = init_srcu_struct(&set->update_nr_hwq_srcu); + if (ret) + goto out_cleanup_srcu; + ret = -ENOMEM; set->tags = kcalloc_node(set->nr_hw_queues, sizeof(struct blk_mq_tags *), GFP_KERNEL, set->numa_node); if (!set->tags) - goto out_cleanup_srcu; + goto out_cleanup_hwq_srcu; for (i = 0; i < set->nr_maps; i++) { set->map[i].mq_map = kcalloc_node(nr_cpu_ids, @@ -4819,6 +4823,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) } kfree(set->tags); set->tags = NULL; +out_cleanup_hwq_srcu: + cleanup_srcu_struct(&set->update_nr_hwq_srcu); out_cleanup_srcu: if (set->flags & BLK_MQ_F_BLOCKING) cleanup_srcu_struct(set->srcu); @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) { mutex_lock(&set->tag_list_lock); + /* + * Mark us in updating nr_hw_queues for preventing switching + * elevator + * + * Elevator switch code can _not_ acquire ->tag_list_lock + */ + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES; + synchronize_srcu(&set->update_nr_hwq_srcu); + __blk_mq_update_nr_hw_queues(set, nr_hw_queues); + + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES; mutex_unlock(&set->tag_list_lock); } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); diff --git a/block/elevator.c b/block/elevator.c index cf48613c6e62..7d7b77dd4341 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -718,9 +718,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, { char elevator_name[ELV_NAME_MAX]; char *name; - int ret; + int ret, idx; unsigned int memflags; struct request_queue *q = disk->queue; + struct blk_mq_tag_set *set = q->tag_set; /* * If the attribute needs to load a module, do it before freezing the @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, elv_iosched_load_module(name); + idx = srcu_read_lock(&set->update_nr_hwq_srcu); + + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { + ret = -EBUSY; + goto exit; + } + memflags = blk_mq_freeze_queue(q); mutex_lock(&q->elevator_lock); ret = elevator_change(q, name); @@ -739,6 +747,8 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, ret = count; mutex_unlock(&q->elevator_lock); blk_mq_unfreeze_queue(q, memflags); +exit: + srcu_read_unlock(&set->update_nr_hwq_srcu, idx); return ret; } diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8eb9b3310167..473871c760e1 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -527,6 +527,7 @@ struct blk_mq_tag_set { struct mutex tag_list_lock; struct list_head tag_list; struct srcu_struct *srcu; + struct srcu_struct update_nr_hwq_srcu; }; /** @@ -681,7 +682,14 @@ enum { */ BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6, - BLK_MQ_F_MAX = 1 << 7, + /* + * True when updating nr_hw_queues is in-progress + * + * tag_set only flag, not usable for hctx + */ + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7, + + BLK_MQ_F_MAX = 1 << 8, }; #define BLK_MQ_MAX_DEPTH (10240)
updating nr_hw_queues is usually used for error handling code, when it doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues may change, and elevator tags depends on nr_hw_queues. Prevent elevator switch during updating nr_hw_queues by setting flag of BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during the period. Here elevator switch code is srcu reader of nr_hw_queues, and blk_mq_update_nr_hw_queues() is the writer. This way avoids lot of trouble. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 19 ++++++++++++++++++- block/elevator.c | 12 +++++++++++- include/linux/blk-mq.h | 10 +++++++++- 4 files changed, 39 insertions(+), 3 deletions(-)