Message ID | 20250410133029.2487054-10-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: unify elevator changing and fix lockdep warning | expand |
On 4/10/25 7:00 PM, Ming Lei wrote: > /* > * Use the default elevator settings. If the chosen elevator initialization > * fails, fall back to the "none" elevator (no elevator). > */ > -void elevator_init_mq(struct request_queue *q) > +void elevator_set_default(struct request_queue *q) > { > - struct elevator_type *e; > - unsigned int memflags; > + struct elev_change_ctx ctx = { }; > int err; > > - WARN_ON_ONCE(blk_queue_registered(q)); > - > - if (unlikely(q->elevator)) > + if (!queue_is_mq(q)) > return; > > - e = elevator_get_default(q); > - if (!e) > + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; > + if (!q->elevator && !strcmp(ctx.name, "none")) > return; > + err = elevator_change(q, &ctx); > + if (err < 0) > + pr_warn("\"%s\" set elevator failed %d, " > + "falling back to \"none\"\n", ctx.name, err); > +} > If we fail to set the evator to default (mq-deadline) while registering queue, because nr_hw_queue update is simultaneously running then we may end up setting the queue elevator to none and that's not correct. Isn't it? > +void elevator_set_none(struct request_queue *q) > +{ > + struct elev_change_ctx ctx = { > + .name = "none", > + .uevent = 1, > + }; > + int err; > > - blk_mq_unfreeze_queue(q, memflags); > + if (!queue_is_mq(q)) > + return; > > - if (err) { > - pr_warn("\"%s\" elevator initialization failed, " > - "falling back to \"none\"\n", e->elevator_name); > - } > + if (!q->elevator) > + return; > > - elevator_put(e); > + err = elevator_change(q, &ctx); > + if (err < 0) > + pr_warn("%s: set none elevator failed %d\n", __func__, err); > } > Here as well if we fail to disable/exit elevator while deleting disk because nr_hw_queue update is simultaneously running then we may leak elevator resource? > @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, > if (disk->major == BLOCK_EXT_MAJOR) > blk_free_ext_minor(disk->first_minor); > out_exit_elevator: > - if (disk->queue->elevator) { > - mutex_lock(&disk->queue->elevator_lock); > - elevator_exit(disk->queue); > - mutex_unlock(&disk->queue->elevator_lock); > - } > + elevator_set_none(disk->queue); Same comment as above here as well but this is in add_disk code path. Thanks, --Nilay
On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote: > > > On 4/10/25 7:00 PM, Ming Lei wrote: > > /* > > * Use the default elevator settings. If the chosen elevator initialization > > * fails, fall back to the "none" elevator (no elevator). > > */ > > -void elevator_init_mq(struct request_queue *q) > > +void elevator_set_default(struct request_queue *q) > > { > > - struct elevator_type *e; > > - unsigned int memflags; > > + struct elev_change_ctx ctx = { }; > > int err; > > > > - WARN_ON_ONCE(blk_queue_registered(q)); > > - > > - if (unlikely(q->elevator)) > > + if (!queue_is_mq(q)) > > return; > > > > - e = elevator_get_default(q); > > - if (!e) > > + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; > > + if (!q->elevator && !strcmp(ctx.name, "none")) > > return; > > + err = elevator_change(q, &ctx); > > + if (err < 0) > > + pr_warn("\"%s\" set elevator failed %d, " > > + "falling back to \"none\"\n", ctx.name, err); > > +} > > > If we fail to set the evator to default (mq-deadline) while registering queue, > because nr_hw_queue update is simultaneously running then we may end up setting > the queue elevator to none and that's not correct. Isn't it? It still works with none. I think it isn't one big deal. And if it is really one issue in future, we can set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set default sched for us. > > > +void elevator_set_none(struct request_queue *q) > > +{ > > + struct elev_change_ctx ctx = { > > + .name = "none", > > + .uevent = 1, > > + }; > > + int err; > > > > - blk_mq_unfreeze_queue(q, memflags); > > + if (!queue_is_mq(q)) > > + return; > > > > - if (err) { > > - pr_warn("\"%s\" elevator initialization failed, " > > - "falling back to \"none\"\n", e->elevator_name); > > - } > > + if (!q->elevator) > > + return; > > > > - elevator_put(e); > > + err = elevator_change(q, &ctx); > > + if (err < 0) > > + pr_warn("%s: set none elevator failed %d\n", __func__, err); > > } > > > Here as well if we fail to disable/exit elevator while deleting disk > because nr_hw_queue update is simultaneously running then we may > leak elevator resource? When blk_mq_update_nr_hw_queues() observes that queue is dying, it forces to change elevator to none, so there isn't elevator leak issue. > > > @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, > > if (disk->major == BLOCK_EXT_MAJOR) > > blk_free_ext_minor(disk->first_minor); > > out_exit_elevator: > > - if (disk->queue->elevator) { > > - mutex_lock(&disk->queue->elevator_lock); > > - elevator_exit(disk->queue); > > - mutex_unlock(&disk->queue->elevator_lock); > > - } > > + elevator_set_none(disk->queue); > Same comment as above here as well but this is in add_disk code path. We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for !blk_queue_registered() Thanks, Ming
On 4/14/25 6:52 AM, Ming Lei wrote: > On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote: >> >> >> On 4/10/25 7:00 PM, Ming Lei wrote: >>> /* >>> * Use the default elevator settings. If the chosen elevator initialization >>> * fails, fall back to the "none" elevator (no elevator). >>> */ >>> -void elevator_init_mq(struct request_queue *q) >>> +void elevator_set_default(struct request_queue *q) >>> { >>> - struct elevator_type *e; >>> - unsigned int memflags; >>> + struct elev_change_ctx ctx = { }; >>> int err; >>> >>> - WARN_ON_ONCE(blk_queue_registered(q)); >>> - >>> - if (unlikely(q->elevator)) >>> + if (!queue_is_mq(q)) >>> return; >>> >>> - e = elevator_get_default(q); >>> - if (!e) >>> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; >>> + if (!q->elevator && !strcmp(ctx.name, "none")) >>> return; >>> + err = elevator_change(q, &ctx); >>> + if (err < 0) >>> + pr_warn("\"%s\" set elevator failed %d, " >>> + "falling back to \"none\"\n", ctx.name, err); >>> +} >>> >> If we fail to set the evator to default (mq-deadline) while registering queue, >> because nr_hw_queue update is simultaneously running then we may end up setting >> the queue elevator to none and that's not correct. Isn't it? > > It still works with none. > > I think it isn't one big deal. And if it is really one issue in future, we can > set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set > default sched for us. > >> >>> +void elevator_set_none(struct request_queue *q) >>> +{ >>> + struct elev_change_ctx ctx = { >>> + .name = "none", >>> + .uevent = 1, >>> + }; >>> + int err; >>> >>> - blk_mq_unfreeze_queue(q, memflags); >>> + if (!queue_is_mq(q)) >>> + return; >>> >>> - if (err) { >>> - pr_warn("\"%s\" elevator initialization failed, " >>> - "falling back to \"none\"\n", e->elevator_name); >>> - } >>> + if (!q->elevator) >>> + return; >>> >>> - elevator_put(e); >>> + err = elevator_change(q, &ctx); >>> + if (err < 0) >>> + pr_warn("%s: set none elevator failed %d\n", __func__, err); >>> } >>> >> Here as well if we fail to disable/exit elevator while deleting disk >> because nr_hw_queue update is simultaneously running then we may >> leak elevator resource? > > When blk_mq_update_nr_hw_queues() observes that queue is dying, it > forces to change elevator to none, so there isn't elevator leak issue. > Yes if we get into blk_mq_update_nr_hw_queues after dying flag is set. But what if blk_mq_update_nr_hw_queues doesn't see dying flag and starts running __elevator_change. However later we set dying flag from del_gendisk and starts running elevator_set_none simultaneously on another cpu? In this case elevator_set_none would fail to set the elevator to "none" as blk_mq_update_nr_hw_queues is running on another cpu. Isn't it? >> >>> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, >>> if (disk->major == BLOCK_EXT_MAJOR) >>> blk_free_ext_minor(disk->first_minor); >>> out_exit_elevator: >>> - if (disk->queue->elevator) { >>> - mutex_lock(&disk->queue->elevator_lock); >>> - elevator_exit(disk->queue); >>> - mutex_unlock(&disk->queue->elevator_lock); >>> - } >>> + elevator_set_none(disk->queue); >> Same comment as above here as well but this is in add_disk code path. > > We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for > !blk_queue_registered() > Here as well there's a thin race window possible assuming add_disk fails after we registered queue. Assuming nr_hw_queue update starts running and it sees queue is registered however on another cpu add_disk fails just after registering queue. So in this case still it might be possible that elevator_set_none might fail to set elevator to "none" just because nr_hw_queue update is running on another cpu. What do you think? Thanks, --Nilay
On Tue, Apr 15, 2025 at 06:00:47PM +0530, Nilay Shroff wrote: > > > On 4/14/25 6:52 AM, Ming Lei wrote: > > On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote: > >> > >> > >> On 4/10/25 7:00 PM, Ming Lei wrote: > >>> /* > >>> * Use the default elevator settings. If the chosen elevator initialization > >>> * fails, fall back to the "none" elevator (no elevator). > >>> */ > >>> -void elevator_init_mq(struct request_queue *q) > >>> +void elevator_set_default(struct request_queue *q) > >>> { > >>> - struct elevator_type *e; > >>> - unsigned int memflags; > >>> + struct elev_change_ctx ctx = { }; > >>> int err; > >>> > >>> - WARN_ON_ONCE(blk_queue_registered(q)); > >>> - > >>> - if (unlikely(q->elevator)) > >>> + if (!queue_is_mq(q)) > >>> return; > >>> > >>> - e = elevator_get_default(q); > >>> - if (!e) > >>> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; > >>> + if (!q->elevator && !strcmp(ctx.name, "none")) > >>> return; > >>> + err = elevator_change(q, &ctx); > >>> + if (err < 0) > >>> + pr_warn("\"%s\" set elevator failed %d, " > >>> + "falling back to \"none\"\n", ctx.name, err); > >>> +} > >>> > >> If we fail to set the evator to default (mq-deadline) while registering queue, > >> because nr_hw_queue update is simultaneously running then we may end up setting > >> the queue elevator to none and that's not correct. Isn't it? > > > > It still works with none. > > > > I think it isn't one big deal. And if it is really one issue in future, we can > > set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set > > default sched for us. > > > >> > >>> +void elevator_set_none(struct request_queue *q) > >>> +{ > >>> + struct elev_change_ctx ctx = { > >>> + .name = "none", > >>> + .uevent = 1, > >>> + }; > >>> + int err; > >>> > >>> - blk_mq_unfreeze_queue(q, memflags); > >>> + if (!queue_is_mq(q)) > >>> + return; > >>> > >>> - if (err) { > >>> - pr_warn("\"%s\" elevator initialization failed, " > >>> - "falling back to \"none\"\n", e->elevator_name); > >>> - } > >>> + if (!q->elevator) > >>> + return; > >>> > >>> - elevator_put(e); > >>> + err = elevator_change(q, &ctx); > >>> + if (err < 0) > >>> + pr_warn("%s: set none elevator failed %d\n", __func__, err); > >>> } > >>> > >> Here as well if we fail to disable/exit elevator while deleting disk > >> because nr_hw_queue update is simultaneously running then we may > >> leak elevator resource? > > > > When blk_mq_update_nr_hw_queues() observes that queue is dying, it > > forces to change elevator to none, so there isn't elevator leak issue. > > > Yes if we get into blk_mq_update_nr_hw_queues after dying flag is set. > But what if blk_mq_update_nr_hw_queues doesn't see dying flag and starts > running __elevator_change. However later we set dying flag from del_gendisk > and starts running elevator_set_none simultaneously on another cpu? > In this case elevator_set_none would fail to set the elevator to "none" as > blk_mq_update_nr_hw_queues is running on another cpu. Isn't it? > > >> > >>> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, > >>> if (disk->major == BLOCK_EXT_MAJOR) > >>> blk_free_ext_minor(disk->first_minor); > >>> out_exit_elevator: > >>> - if (disk->queue->elevator) { > >>> - mutex_lock(&disk->queue->elevator_lock); > >>> - elevator_exit(disk->queue); > >>> - mutex_unlock(&disk->queue->elevator_lock); > >>> - } > >>> + elevator_set_none(disk->queue); > >> Same comment as above here as well but this is in add_disk code path. > > > > We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for > > !blk_queue_registered() > > > Here as well there's a thin race window possible assuming add_disk fails > after we registered queue. Assuming nr_hw_queue update starts running > and it sees queue is registered however on another cpu add_disk fails > just after registering queue. So in this case still it might be possible > that elevator_set_none might fail to set elevator to "none" just because > nr_hw_queue update is running on another cpu. What do you think? Yeah. It isn't hard to solve, but just don't want to make the whole implementation too complicated. Another way is to prevent add_disk & del_disk from happening during updating nr_hw_queues, and this way is reasonable too because both blk_mq debugfs & sysfs registering depends on nr_hw_queues. Meantime we can retry add_disk/del_disk until updating_nr_hw_queues are finished, and one waitqueue can be added, so the wait can be: add_disk(): while (true) { srcu_read_lock() if (set->is_updating_nr_hw_queus) { srcu_read_unlock(); goto wait; } __add_disk(); srcu_read_unlock() break; wait: wait_event(set->wq, !set->is_updating_nr_hw_queus); } Thanks, Ming
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index a2882751f0d2..58c50709bc14 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -869,14 +869,8 @@ int blk_register_queue(struct gendisk *disk) if (ret) goto out_unregister_ia_ranges; + elevator_set_default(q); mutex_lock(&q->elevator_lock); - if (q->elevator) { - ret = elv_register_queue(q, false); - if (ret) { - mutex_unlock(&q->elevator_lock); - goto out_crypto_sysfs_unregister; - } - } wbt_enable_default(disk); mutex_unlock(&q->elevator_lock); @@ -902,8 +896,6 @@ int blk_register_queue(struct gendisk *disk) return ret; -out_crypto_sysfs_unregister: - blk_crypto_sysfs_unregister(disk); out_unregister_ia_ranges: disk_unregister_independent_access_ranges(disk); out_debugfs_remove: @@ -949,9 +941,11 @@ void blk_unregister_queue(struct gendisk *disk) blk_mq_sysfs_unregister(disk); blk_crypto_sysfs_unregister(disk); - mutex_lock(&q->elevator_lock); - elv_unregister_queue(q); - mutex_unlock(&q->elevator_lock); + if (q->elevator) { + blk_mq_quiesce_queue(q); + elevator_set_none(q); + blk_mq_unquiesce_queue(q); + } mutex_lock(&q->sysfs_lock); disk_unregister_independent_access_ranges(disk); diff --git a/block/blk.h b/block/blk.h index 922a429b5363..4626beedfdce 100644 --- a/block/blk.h +++ b/block/blk.h @@ -321,9 +321,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list, bool blk_insert_flush(struct request *rq); int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx); -void elevator_exit(struct request_queue *q); -int elv_register_queue(struct request_queue *q, bool uevent); -void elv_unregister_queue(struct request_queue *q); +void elevator_set_default(struct request_queue *q); +void elevator_set_none(struct request_queue *q); ssize_t part_size_show(struct device *dev, struct device_attribute *attr, char *buf); diff --git a/block/elevator.c b/block/elevator.c index 2bc1679dcd1f..7d2a56ef0be6 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -151,7 +151,7 @@ static void elevator_release(struct kobject *kobj) kfree(e); } -void elevator_exit(struct request_queue *q) +static void elevator_exit(struct request_queue *q) { struct elevator_queue *e = q->elevator; @@ -455,7 +455,7 @@ static const struct kobj_type elv_ktype = { .release = elevator_release, }; -int elv_register_queue(struct request_queue *q, bool uevent) +static int elv_register_queue(struct request_queue *q, bool uevent) { struct elevator_queue *e = q->elevator; int error; @@ -484,7 +484,7 @@ int elv_register_queue(struct request_queue *q, bool uevent) return error; } -void elv_unregister_queue(struct request_queue *q) +static void elv_unregister_queue(struct request_queue *q) { struct elevator_queue *e = q->elevator; @@ -560,60 +560,56 @@ EXPORT_SYMBOL_GPL(elv_unregister); * For single queue devices, default to using mq-deadline. If we have multiple * queues or mq-deadline is not available, default to "none". */ -static struct elevator_type *elevator_get_default(struct request_queue *q) +static bool use_default_elevator(struct request_queue *q) { if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT) - return NULL; + return false; if (q->nr_hw_queues != 1 && !blk_mq_is_shared_tags(q->tag_set->flags)) - return NULL; + return false; - return elevator_find_get("mq-deadline"); + return true; } /* * Use the default elevator settings. If the chosen elevator initialization * fails, fall back to the "none" elevator (no elevator). */ -void elevator_init_mq(struct request_queue *q) +void elevator_set_default(struct request_queue *q) { - struct elevator_type *e; - unsigned int memflags; + struct elev_change_ctx ctx = { }; int err; - WARN_ON_ONCE(blk_queue_registered(q)); - - if (unlikely(q->elevator)) + if (!queue_is_mq(q)) return; - e = elevator_get_default(q); - if (!e) + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; + if (!q->elevator && !strcmp(ctx.name, "none")) return; + err = elevator_change(q, &ctx); + if (err < 0) + pr_warn("\"%s\" set elevator failed %d, " + "falling back to \"none\"\n", ctx.name, err); +} - /* - * We are called before adding disk, when there isn't any FS I/O, - * so freezing queue plus canceling dispatch work is enough to - * drain any dispatch activities originated from passthrough - * requests, then no need to quiesce queue which may add long boot - * latency, especially when lots of disks are involved. - * - * Disk isn't added yet, so verifying queue lock only manually. - */ - memflags = blk_mq_freeze_queue(q); - - blk_mq_cancel_work_sync(q); - - err = blk_mq_init_sched(q, e); +void elevator_set_none(struct request_queue *q) +{ + struct elev_change_ctx ctx = { + .name = "none", + .uevent = 1, + }; + int err; - blk_mq_unfreeze_queue(q, memflags); + if (!queue_is_mq(q)) + return; - if (err) { - pr_warn("\"%s\" elevator initialization failed, " - "falling back to \"none\"\n", e->elevator_name); - } + if (!q->elevator) + return; - elevator_put(e); + err = elevator_change(q, &ctx); + if (err < 0) + pr_warn("%s: set none elevator failed %d\n", __func__, err); } /* @@ -718,6 +714,16 @@ static int elevator_change(struct request_queue *q, } memflags = blk_mq_freeze_queue(q); + /* + * May be called before adding disk, when there isn't any FS I/O, + * so freezing queue plus canceling dispatch work is enough to + * drain any dispatch activities originated from passthrough + * requests, then no need to quiesce queue which may add long boot + * latency, especially when lots of disks are involved. + * + * Disk isn't added yet, so verifying queue lock only manually. + */ + blk_mq_cancel_work_sync(q); mutex_lock(&q->elevator_lock); ret = __elevator_change(q, ctx); mutex_unlock(&q->elevator_lock); diff --git a/block/genhd.c b/block/genhd.c index f426c13edf55..d7264546a178 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -416,12 +416,6 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, */ if (disk->fops->submit_bio || disk->fops->poll_bio) return -EINVAL; - - /* - * Initialize the I/O scheduler code and pick a default one if - * needed. - */ - elevator_init_mq(disk->queue); } else { if (!disk->fops->submit_bio) return -EINVAL; @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, if (disk->major == BLOCK_EXT_MAJOR) blk_free_ext_minor(disk->first_minor); out_exit_elevator: - if (disk->queue->elevator) { - mutex_lock(&disk->queue->elevator_lock); - elevator_exit(disk->queue); - mutex_unlock(&disk->queue->elevator_lock); - } + elevator_set_none(disk->queue); return ret; } EXPORT_SYMBOL_GPL(add_disk_fwnode); @@ -743,14 +733,7 @@ void del_gendisk(struct gendisk *disk) if (queue_is_mq(q)) blk_mq_cancel_work_sync(q); - blk_mq_quiesce_queue(q); - if (q->elevator) { - mutex_lock(&q->elevator_lock); - elevator_exit(q); - mutex_unlock(&q->elevator_lock); - } rq_qos_exit(q); - blk_mq_unquiesce_queue(q); /* * If the disk does not own the queue, allow using passthrough requests
elevator change is one well-define behavior: - tear down current elevator if it exists - setup new elevator It is supposed to cover any case for changing elevator, typically the following cases: - setup default elevator in add_disk() - switch to none in del_disk() - reset elevator in blk_mq_update_nr_hw_queues() - switch elevator in sysfs `store` elevator attribute This patch uses elevator_change() to cover all above cases: - every elevator switch is serialized with each other: add_disk/del_disk/ store elevator is serialized already, blk_mq_update_nr_hw_queues() uses srcu for syncing with the other three cases - for both add_disk()/del_disk(), queue freeze works at atomic mode or has been froze, so the freeze in elevator_change() won't add extra delay Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-sysfs.c | 18 ++++------- block/blk.h | 5 ++-- block/elevator.c | 76 +++++++++++++++++++++++++---------------------- block/genhd.c | 19 +----------- 4 files changed, 50 insertions(+), 68 deletions(-)