Message ID | 20250403105402.1334206-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: don't grab elevator lock during queue initialization | expand |
On 4/3/25 4:24 PM, Ming Lei wrote: > ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. > > queue freeze lock depends on fs_reclaim. > > So don't grab elevator lock during queue initialization which needs to > call kmalloc(GFP_KERNEL), and we can cut the dependency between > ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. > > This way is safe because elevator setting isn't ready to run during > queue initialization. > > There isn't such issue in __blk_mq_update_nr_hw_queues() because > memalloc_noio_save() is called before acquiring elevator lock. > > Fixes the following lockdep warning: > > https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ > > Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com > Cc: Nilay Shroff <nilay@linux.ibm.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-mq.c | 24 +++++++++++++++++------- > 1 file changed, 17 insertions(+), 7 deletions(-) > I think you earlier posted this same patch here: https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/ I tested that earlier patch and got into another lockdep splat as reported here: https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/ I don't know why we think your earlier fix which cut dependency between ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view is that we've got into these lock chains from two different code paths: path1: elevator_lock -> fs_reclaim (GFP_KERNEL) -> freeze_lock path2: freeze_lock(memalloc_noio) -> elevator_lock -> fs_reclaim (this becomes NOP in this case due to memalloc_noio) As you could see above, we've got into circular dependency between ->elevator_lock and ->freeze_lock. I thought with your earlier patch we were able to cut that dependency using blk_mq_enter_no_io and blk_mq_exit_no_io. But I'm not sure why we think that won't work? Thanks, --Nilay
On Thu, Apr 03, 2025 at 06:49:05PM +0530, Nilay Shroff wrote: > > > On 4/3/25 4:24 PM, Ming Lei wrote: > > ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. > > > > queue freeze lock depends on fs_reclaim. > > > > So don't grab elevator lock during queue initialization which needs to > > call kmalloc(GFP_KERNEL), and we can cut the dependency between > > ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. > > > > This way is safe because elevator setting isn't ready to run during > > queue initialization. > > > > There isn't such issue in __blk_mq_update_nr_hw_queues() because > > memalloc_noio_save() is called before acquiring elevator lock. > > > > Fixes the following lockdep warning: > > > > https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ > > > > Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com > > Cc: Nilay Shroff <nilay@linux.ibm.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-mq.c | 24 +++++++++++++++++------- > > 1 file changed, 17 insertions(+), 7 deletions(-) > > > I think you earlier posted this same patch here: > https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/ > > I tested that earlier patch and got into another lockdep splat as reported here: > https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/ That is another different one, let's fix this one first. The ->elevator_lock in blk_register_queue() should be for avoiding race with updating nr_hw_queues, right? That is why I think the dependency between elevator lock and freeze lock is one trouble. > > I don't know why we think your earlier fix which cut dependency between > ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view > is that we've got into these lock chains from two different code paths: As I explained, blk_mq_enter_no_io() is same with freeze queue, just the lock in __bio_queue_enter() isn't modeled. If it is done, every lockdep warning will be re-triggered too. > > path1: elevator_lock > -> fs_reclaim (GFP_KERNEL) > -> freeze_lock > > path2: freeze_lock(memalloc_noio) > -> elevator_lock > -> fs_reclaim (this becomes NOP in this case due to memalloc_noio) No, there isn't fs_reclaim in path2, and memalloc_noio() will avoid it. Thanks, Ming
On Thu, 03 Apr 2025 18:54:02 +0800, Ming Lei wrote: > ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. > > queue freeze lock depends on fs_reclaim. > > So don't grab elevator lock during queue initialization which needs to > call kmalloc(GFP_KERNEL), and we can cut the dependency between > ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. > > [...] Applied, thanks! [1/1] block: don't grab elevator lock during queue initialization commit: 01b91bf14f6d4893e03e357006e7af3a20c03fee Best regards,
On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > Fixes the following lockdep warning: Please spell the actual dependency out here, links are not permanent and also not readable for any offline reading of the commit logs. > +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > + struct request_queue *q, bool lock) > +{ > + if (lock) { bool lock(ed) arguments are an anti-pattern, and regularly get Linus screaming at you (in this case even for the right reason :)) > + /* protect against switching io scheduler */ > + mutex_lock(&q->elevator_lock); > + __blk_mq_realloc_hw_ctxs(set, q); > + mutex_unlock(&q->elevator_lock); > + } else { > + __blk_mq_realloc_hw_ctxs(set, q); > + } I think the problem here is again that because of all the other dependencies elevator_lock really needs to be per-set instead of per-queue which will allows us to have much saner locking hierarchies.
On Fri, Apr 04, 2025 at 11:10:37AM +0200, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > > Fixes the following lockdep warning: > > Please spell the actual dependency out here, links are not permanent > and also not readable for any offline reading of the commit logs. > > > +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > > + struct request_queue *q, bool lock) > > +{ > > + if (lock) { > > bool lock(ed) arguments are an anti-pattern, and regularly get Linus > screaming at you (in this case even for the right reason :)) > > > + /* protect against switching io scheduler */ > > + mutex_lock(&q->elevator_lock); > > + __blk_mq_realloc_hw_ctxs(set, q); > > + mutex_unlock(&q->elevator_lock); > > + } else { > > + __blk_mq_realloc_hw_ctxs(set, q); > > + } > > I think the problem here is again that because of all the other > dependencies elevator_lock really needs to be per-set instead of > per-queue which will allows us to have much saner locking hierarchies. per-set lock is required in error handling code path, anywhere it is used with freezing together can be one deadlock risk if hardware error happens. Or can you explain the idea in details? Thanks, Ming
>> I don't know why we think your earlier fix which cut dependency between >> ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view >> is that we've got into these lock chains from two different code paths: > > As I explained, blk_mq_enter_no_io() is same with freeze queue, just the > lock in __bio_queue_enter() isn't modeled. If it is done, every lockdep > warning will be re-triggered too. > Oh I see, because I tested your earlier patches without lock modeled I didn't encounter any lockdep warning. >> >> path1: elevator_lock >> -> fs_reclaim (GFP_KERNEL) >> -> freeze_lock >> >> path2: freeze_lock(memalloc_noio) >> -> elevator_lock >> -> fs_reclaim (this becomes NOP in this case due to memalloc_noio) > > No, there isn't fs_reclaim in path2, and memalloc_noio() will avoid it. > Yes correct and so I mentioned above NOP for fs_reclaim in path2. Thanks, --Nilay
On 4/4/25 2:40 PM, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: >> Fixes the following lockdep warning: > > Please spell the actual dependency out here, links are not permanent > and also not readable for any offline reading of the commit logs. > >> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, >> + struct request_queue *q, bool lock) >> +{ >> + if (lock) { > > bool lock(ed) arguments are an anti-pattern, and regularly get Linus > screaming at you (in this case even for the right reason :)) > >> + /* protect against switching io scheduler */ >> + mutex_lock(&q->elevator_lock); >> + __blk_mq_realloc_hw_ctxs(set, q); >> + mutex_unlock(&q->elevator_lock); >> + } else { >> + __blk_mq_realloc_hw_ctxs(set, q); >> + } > > I think the problem here is again that because of all the other > dependencies elevator_lock really needs to be per-set instead of > per-queue which will allows us to have much saner locking hierarchies. > I believe you meant here q->tag_set->elevator_lock? If yes then it means that we should be able to grab ->elevator_lock before freezing the queue in __blk_mq_update_nr_hw_queues and so locking order should be in each code path, __blk_mq_update_nr_hw_queues ->elevator_lock ->freeze_lock or blk_register_queue ->elevator_lock -> fs_reclaim (GFP_KERNEL) -> freeze_lock Other code paths using ->elevator_lock and ->freeze_lock shall be updated accordingly. Thanks, --Nilay
On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: > > > On 4/4/25 2:40 PM, Christoph Hellwig wrote: > > On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > >> Fixes the following lockdep warning: > > > > Please spell the actual dependency out here, links are not permanent > > and also not readable for any offline reading of the commit logs. > > > >> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > >> + struct request_queue *q, bool lock) > >> +{ > >> + if (lock) { > > > > bool lock(ed) arguments are an anti-pattern, and regularly get Linus > > screaming at you (in this case even for the right reason :)) > > > >> + /* protect against switching io scheduler */ > >> + mutex_lock(&q->elevator_lock); > >> + __blk_mq_realloc_hw_ctxs(set, q); > >> + mutex_unlock(&q->elevator_lock); > >> + } else { > >> + __blk_mq_realloc_hw_ctxs(set, q); > >> + } > > > > I think the problem here is again that because of all the other > > dependencies elevator_lock really needs to be per-set instead of > > per-queue which will allows us to have much saner locking hierarchies. > > > I believe you meant here q->tag_set->elevator_lock? I don't know what locks you are planning to invent. For set->tag_list_lock, it has been very fragile: blk_mq_update_nr_hw_queues set->tag_list_lock freeze_queue If IO failure happens when waiting in above freeze_queue(), the nvme error handling can't provide forward progress any more, because the error handling code path requires set->tag_list_lock. So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, fortunately that is what nvme is doing. > If yes then it means that we should be able to grab ->elevator_lock > before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > order should be in each code path, > > __blk_mq_update_nr_hw_queues > ->elevator_lock > ->freeze_lock Now tagset->elevator_lock depends on set->tag_list_lock, and this way just make things worse. Why can't we disable elevator switch during updating nr_hw_queues? Thanks, Ming
On Fri, Apr 04, 2025 at 08:09:31PM +0800, Ming Lei wrote: > > I think the problem here is again that because of all the other > > dependencies elevator_lock really needs to be per-set instead of > > per-queue which will allows us to have much saner locking hierarchies. > > per-set lock is required in error handling code path, anywhere it is used > with freezing together can be one deadlock risk if hardware error happens. > > Or can you explain the idea in details? I fail to parse the above sentence. What about a per-set lock always taken before freezing queues has an inherentl deadlock risk?
On 4/7/25 8:39 AM, Ming Lei wrote: > On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: >> >> >> On 4/4/25 2:40 PM, Christoph Hellwig wrote: >>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: >>>> Fixes the following lockdep warning: >>> >>> Please spell the actual dependency out here, links are not permanent >>> and also not readable for any offline reading of the commit logs. >>> >>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, >>>> + struct request_queue *q, bool lock) >>>> +{ >>>> + if (lock) { >>> >>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus >>> screaming at you (in this case even for the right reason :)) >>> >>>> + /* protect against switching io scheduler */ >>>> + mutex_lock(&q->elevator_lock); >>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>> + mutex_unlock(&q->elevator_lock); >>>> + } else { >>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>> + } >>> >>> I think the problem here is again that because of all the other >>> dependencies elevator_lock really needs to be per-set instead of >>> per-queue which will allows us to have much saner locking hierarchies. >>> >> I believe you meant here q->tag_set->elevator_lock? > > I don't know what locks you are planning to invent. > > For set->tag_list_lock, it has been very fragile: > > blk_mq_update_nr_hw_queues > set->tag_list_lock > freeze_queue > > If IO failure happens when waiting in above freeze_queue(), the nvme error > handling can't provide forward progress any more, because the error > handling code path requires set->tag_list_lock. I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues which is called in nvme error handling path. If yes then I believe this function could be easily modified so that it doesn't require ->tag_list_lock. > > So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > fortunately that is what nvme is doing. > > >> If yes then it means that we should be able to grab ->elevator_lock >> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking >> order should be in each code path, >> >> __blk_mq_update_nr_hw_queues >> ->elevator_lock >> ->freeze_lock > > Now tagset->elevator_lock depends on set->tag_list_lock, and this way > just make things worse. Why can't we disable elevator switch during > updating nr_hw_queues? > I couldn't quite understand this. As we already first disable the elevator before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). Once mapping is successful we switch back the elevator. Thanks, --Nilay
On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote: > > > On 4/7/25 8:39 AM, Ming Lei wrote: > > On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: > >> > >> > >> On 4/4/25 2:40 PM, Christoph Hellwig wrote: > >>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > >>>> Fixes the following lockdep warning: > >>> > >>> Please spell the actual dependency out here, links are not permanent > >>> and also not readable for any offline reading of the commit logs. > >>> > >>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > >>>> + struct request_queue *q, bool lock) > >>>> +{ > >>>> + if (lock) { > >>> > >>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus > >>> screaming at you (in this case even for the right reason :)) > >>> > >>>> + /* protect against switching io scheduler */ > >>>> + mutex_lock(&q->elevator_lock); > >>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>> + mutex_unlock(&q->elevator_lock); > >>>> + } else { > >>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>> + } > >>> > >>> I think the problem here is again that because of all the other > >>> dependencies elevator_lock really needs to be per-set instead of > >>> per-queue which will allows us to have much saner locking hierarchies. > >>> > >> I believe you meant here q->tag_set->elevator_lock? > > > > I don't know what locks you are planning to invent. > > > > For set->tag_list_lock, it has been very fragile: > > > > blk_mq_update_nr_hw_queues > > set->tag_list_lock > > freeze_queue > > > > If IO failure happens when waiting in above freeze_queue(), the nvme error > > handling can't provide forward progress any more, because the error > > handling code path requires set->tag_list_lock. > > I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues Yes. > which is called in nvme error handling path. If yes then I believe this function > could be easily modified so that it doesn't require ->tag_list_lock. Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". And the same list is iterated in blk_mq_update_nr_hw_queues() too. > > > > > So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > > fortunately that is what nvme is doing. > > > > > >> If yes then it means that we should be able to grab ->elevator_lock > >> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > >> order should be in each code path, > >> > >> __blk_mq_update_nr_hw_queues > >> ->elevator_lock > >> ->freeze_lock > > > > Now tagset->elevator_lock depends on set->tag_list_lock, and this way > > just make things worse. Why can't we disable elevator switch during > > updating nr_hw_queues? > > > I couldn't quite understand this. As we already first disable the elevator > before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). > Once mapping is successful we switch back the elevator. Yes, but user still may switch elevator from none to others during the period, right? thanks, Ming
On 4/8/25 1:08 PM, Ming Lei wrote: > On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote: >> >> >> On 4/7/25 8:39 AM, Ming Lei wrote: >>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: >>>> >>>> >>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote: >>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: >>>>>> Fixes the following lockdep warning: >>>>> >>>>> Please spell the actual dependency out here, links are not permanent >>>>> and also not readable for any offline reading of the commit logs. >>>>> >>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, >>>>>> + struct request_queue *q, bool lock) >>>>>> +{ >>>>>> + if (lock) { >>>>> >>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus >>>>> screaming at you (in this case even for the right reason :)) >>>>> >>>>>> + /* protect against switching io scheduler */ >>>>>> + mutex_lock(&q->elevator_lock); >>>>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>>>> + mutex_unlock(&q->elevator_lock); >>>>>> + } else { >>>>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>>>> + } >>>>> >>>>> I think the problem here is again that because of all the other >>>>> dependencies elevator_lock really needs to be per-set instead of >>>>> per-queue which will allows us to have much saner locking hierarchies. >>>>> >>>> I believe you meant here q->tag_set->elevator_lock? >>> >>> I don't know what locks you are planning to invent. >>> >>> For set->tag_list_lock, it has been very fragile: >>> >>> blk_mq_update_nr_hw_queues >>> set->tag_list_lock >>> freeze_queue >>> >>> If IO failure happens when waiting in above freeze_queue(), the nvme error >>> handling can't provide forward progress any more, because the error >>> handling code path requires set->tag_list_lock. >> >> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues > > Yes. > >> which is called in nvme error handling path. If yes then I believe this function >> could be easily modified so that it doesn't require ->tag_list_lock. > > Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". > Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 777db89fdaa7..002d2fd20e0c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) { if (!ctrl->tagset) return; - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) - blk_mq_quiesce_tagset(ctrl->tagset); - else - blk_mq_wait_quiesce_done(ctrl->tagset); + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { + struct nvme_ns *ns; + int srcu_idx; + + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, + srcu_read_lock_held(&ctrl->srcu)) { + if (!blk_queue_skip_tagset_quiesce(ns->queue)) + blk_mq_quiesce_queue_nowait(ns->queue); + } + srcu_read_unlock(&ctrl->srcu, srcu_idx); + } + blk_mq_wait_quiesce_done(ctrl->tagset); } EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); Here we iterate through ctrl->namespaces instead of relying on tag_list and so we don't need to acquire ->tag_list_lock. > And the same list is iterated in blk_mq_update_nr_hw_queues() too. > >> >>> >>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, >>> fortunately that is what nvme is doing. >>> >>> >>>> If yes then it means that we should be able to grab ->elevator_lock >>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking >>>> order should be in each code path, >>>> >>>> __blk_mq_update_nr_hw_queues >>>> ->elevator_lock >>>> ->freeze_lock >>> >>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way >>> just make things worse. Why can't we disable elevator switch during >>> updating nr_hw_queues? >>> >> I couldn't quite understand this. As we already first disable the elevator >> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). >> Once mapping is successful we switch back the elevator. > > Yes, but user still may switch elevator from none to others during the > period, right? > Yes correct, that's possible. So your suggestion was to disable elevator update while we're running __blk_mq_update_nr_hw_queues? And that way user couldn't update elevator through sysfs (elv_iosched_store) while we update nr_hw_queues? If this is true then still how could it help solve lockdep splat? Thanks, --Nilay
On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote: > > > On 4/8/25 1:08 PM, Ming Lei wrote: > > On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote: > >> > >> > >> On 4/7/25 8:39 AM, Ming Lei wrote: > >>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: > >>>> > >>>> > >>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote: > >>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > >>>>>> Fixes the following lockdep warning: > >>>>> > >>>>> Please spell the actual dependency out here, links are not permanent > >>>>> and also not readable for any offline reading of the commit logs. > >>>>> > >>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > >>>>>> + struct request_queue *q, bool lock) > >>>>>> +{ > >>>>>> + if (lock) { > >>>>> > >>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus > >>>>> screaming at you (in this case even for the right reason :)) > >>>>> > >>>>>> + /* protect against switching io scheduler */ > >>>>>> + mutex_lock(&q->elevator_lock); > >>>>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>>>> + mutex_unlock(&q->elevator_lock); > >>>>>> + } else { > >>>>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>>>> + } > >>>>> > >>>>> I think the problem here is again that because of all the other > >>>>> dependencies elevator_lock really needs to be per-set instead of > >>>>> per-queue which will allows us to have much saner locking hierarchies. > >>>>> > >>>> I believe you meant here q->tag_set->elevator_lock? > >>> > >>> I don't know what locks you are planning to invent. > >>> > >>> For set->tag_list_lock, it has been very fragile: > >>> > >>> blk_mq_update_nr_hw_queues > >>> set->tag_list_lock > >>> freeze_queue > >>> > >>> If IO failure happens when waiting in above freeze_queue(), the nvme error > >>> handling can't provide forward progress any more, because the error > >>> handling code path requires set->tag_list_lock. > >> > >> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues > > > > Yes. > > > >> which is called in nvme error handling path. If yes then I believe this function > >> could be easily modified so that it doesn't require ->tag_list_lock. > > > > Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". > > > Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 777db89fdaa7..002d2fd20e0c 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) > { > if (!ctrl->tagset) > return; > - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) > - blk_mq_quiesce_tagset(ctrl->tagset); > - else > - blk_mq_wait_quiesce_done(ctrl->tagset); > + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { > + struct nvme_ns *ns; > + int srcu_idx; > + > + srcu_idx = srcu_read_lock(&ctrl->srcu); > + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, > + srcu_read_lock_held(&ctrl->srcu)) { > + if (!blk_queue_skip_tagset_quiesce(ns->queue)) > + blk_mq_quiesce_queue_nowait(ns->queue); > + } > + srcu_read_unlock(&ctrl->srcu, srcu_idx); > + } > + blk_mq_wait_quiesce_done(ctrl->tagset); > } > EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); > > Here we iterate through ctrl->namespaces instead of relying on tag_list > and so we don't need to acquire ->tag_list_lock. How can you make sure all NSs are covered in this way? RCU/SRCU can't provide such kind of guarantee. > > > And the same list is iterated in blk_mq_update_nr_hw_queues() too. > > > >> > >>> > >>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > >>> fortunately that is what nvme is doing. > >>> > >>> > >>>> If yes then it means that we should be able to grab ->elevator_lock > >>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > >>>> order should be in each code path, > >>>> > >>>> __blk_mq_update_nr_hw_queues > >>>> ->elevator_lock > >>>> ->freeze_lock > >>> > >>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way > >>> just make things worse. Why can't we disable elevator switch during > >>> updating nr_hw_queues? > >>> > >> I couldn't quite understand this. As we already first disable the elevator > >> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). > >> Once mapping is successful we switch back the elevator. > > > > Yes, but user still may switch elevator from none to others during the > > period, right? > > > Yes correct, that's possible. So your suggestion was to disable elevator > update while we're running __blk_mq_update_nr_hw_queues? And that way user > couldn't update elevator through sysfs (elv_iosched_store) while we update > nr_hw_queues? If this is true then still how could it help solve lockdep > splat? Then why do you think per-set lock can solve the lockdep splat? __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues involved wrt. switching elevator. If elevator switching is not allowed when __blk_mq_update_nr_hw_queues() is started, why do we need per-set lock? Thanks, Ming
On 4/8/25 7:20 PM, Ming Lei wrote: > On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote: >> >> >> On 4/8/25 1:08 PM, Ming Lei wrote: >>> On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote: >>>> >>>> >>>> On 4/7/25 8:39 AM, Ming Lei wrote: >>>>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: >>>>>> >>>>>> >>>>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote: >>>>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: >>>>>>>> Fixes the following lockdep warning: >>>>>>> >>>>>>> Please spell the actual dependency out here, links are not permanent >>>>>>> and also not readable for any offline reading of the commit logs. >>>>>>> >>>>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, >>>>>>>> + struct request_queue *q, bool lock) >>>>>>>> +{ >>>>>>>> + if (lock) { >>>>>>> >>>>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus >>>>>>> screaming at you (in this case even for the right reason :)) >>>>>>> >>>>>>>> + /* protect against switching io scheduler */ >>>>>>>> + mutex_lock(&q->elevator_lock); >>>>>>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>>>>>> + mutex_unlock(&q->elevator_lock); >>>>>>>> + } else { >>>>>>>> + __blk_mq_realloc_hw_ctxs(set, q); >>>>>>>> + } >>>>>>> >>>>>>> I think the problem here is again that because of all the other >>>>>>> dependencies elevator_lock really needs to be per-set instead of >>>>>>> per-queue which will allows us to have much saner locking hierarchies. >>>>>>> >>>>>> I believe you meant here q->tag_set->elevator_lock? >>>>> >>>>> I don't know what locks you are planning to invent. >>>>> >>>>> For set->tag_list_lock, it has been very fragile: >>>>> >>>>> blk_mq_update_nr_hw_queues >>>>> set->tag_list_lock >>>>> freeze_queue >>>>> >>>>> If IO failure happens when waiting in above freeze_queue(), the nvme error >>>>> handling can't provide forward progress any more, because the error >>>>> handling code path requires set->tag_list_lock. >>>> >>>> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues >>> >>> Yes. >>> >>>> which is called in nvme error handling path. If yes then I believe this function >>>> could be easily modified so that it doesn't require ->tag_list_lock. >>> >>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". >>> >> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 777db89fdaa7..002d2fd20e0c 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) >> { >> if (!ctrl->tagset) >> return; >> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) >> - blk_mq_quiesce_tagset(ctrl->tagset); >> - else >> - blk_mq_wait_quiesce_done(ctrl->tagset); >> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { >> + struct nvme_ns *ns; >> + int srcu_idx; >> + >> + srcu_idx = srcu_read_lock(&ctrl->srcu); >> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, >> + srcu_read_lock_held(&ctrl->srcu)) { >> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) >> + blk_mq_quiesce_queue_nowait(ns->queue); >> + } >> + srcu_read_unlock(&ctrl->srcu, srcu_idx); >> + } >> + blk_mq_wait_quiesce_done(ctrl->tagset); >> } >> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); >> >> Here we iterate through ctrl->namespaces instead of relying on tag_list >> and so we don't need to acquire ->tag_list_lock. > > How can you make sure all NSs are covered in this way? RCU/SRCU can't > provide such kind of guarantee. > Why is that so? In fact, nvme_wait_freeze also iterates through the same ctrl->namespaces to freeze the queue. >> >>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. >>> >>>> >>>>> >>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, >>>>> fortunately that is what nvme is doing. >>>>> >>>>> >>>>>> If yes then it means that we should be able to grab ->elevator_lock >>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking >>>>>> order should be in each code path, >>>>>> >>>>>> __blk_mq_update_nr_hw_queues >>>>>> ->elevator_lock >>>>>> ->freeze_lock >>>>> >>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way >>>>> just make things worse. Why can't we disable elevator switch during >>>>> updating nr_hw_queues? >>>>> >>>> I couldn't quite understand this. As we already first disable the elevator >>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). >>>> Once mapping is successful we switch back the elevator. >>> >>> Yes, but user still may switch elevator from none to others during the >>> period, right? >>> >> Yes correct, that's possible. So your suggestion was to disable elevator >> update while we're running __blk_mq_update_nr_hw_queues? And that way user >> couldn't update elevator through sysfs (elv_iosched_store) while we update >> nr_hw_queues? If this is true then still how could it help solve lockdep >> splat? > > Then why do you think per-set lock can solve the lockdep splat? > > __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > involved wrt. switching elevator. If elevator switching is not allowed > when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > lock? > Yes if elevator switch is not allowed then we probably don't need per-set lock. However my question was if we were to not allow elevator switch while __blk_mq_update_nr_hw_queues is running then how would we implement it? Do we need to synchronize with ->tag_list_lock? Or in another words, elv_iosched_store would now depends on ->tag_list_lock ? On another note, if we choose to make ->elevator_lock per-set then our locking sequence in blk_mq_update_nr_hw_queues() would be, blk_mq_update_nr_hw_queues -> tag_list_lock -> elevator_lock -> freeze_lock elv_iosched_store -> elevator_lock -> freeze_lock So now ->freeze_lock should not depend on ->elevator_lock and that shall help avoid few of the recent lockdep splats reported with fs_reclaim. What do you think? Thanks, --Nilay
On Wed, Apr 09, 2025 at 02:42:06PM +0530, Nilay Shroff wrote: > > > On 4/8/25 7:20 PM, Ming Lei wrote: > > On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote: > >> > >> > >> On 4/8/25 1:08 PM, Ming Lei wrote: > >>> On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote: > >>>> > >>>> > >>>> On 4/7/25 8:39 AM, Ming Lei wrote: > >>>>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote: > >>>>>> > >>>>>> > >>>>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote: > >>>>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote: > >>>>>>>> Fixes the following lockdep warning: > >>>>>>> > >>>>>>> Please spell the actual dependency out here, links are not permanent > >>>>>>> and also not readable for any offline reading of the commit logs. > >>>>>>> > >>>>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, > >>>>>>>> + struct request_queue *q, bool lock) > >>>>>>>> +{ > >>>>>>>> + if (lock) { > >>>>>>> > >>>>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus > >>>>>>> screaming at you (in this case even for the right reason :)) > >>>>>>> > >>>>>>>> + /* protect against switching io scheduler */ > >>>>>>>> + mutex_lock(&q->elevator_lock); > >>>>>>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>>>>>> + mutex_unlock(&q->elevator_lock); > >>>>>>>> + } else { > >>>>>>>> + __blk_mq_realloc_hw_ctxs(set, q); > >>>>>>>> + } > >>>>>>> > >>>>>>> I think the problem here is again that because of all the other > >>>>>>> dependencies elevator_lock really needs to be per-set instead of > >>>>>>> per-queue which will allows us to have much saner locking hierarchies. > >>>>>>> > >>>>>> I believe you meant here q->tag_set->elevator_lock? > >>>>> > >>>>> I don't know what locks you are planning to invent. > >>>>> > >>>>> For set->tag_list_lock, it has been very fragile: > >>>>> > >>>>> blk_mq_update_nr_hw_queues > >>>>> set->tag_list_lock > >>>>> freeze_queue > >>>>> > >>>>> If IO failure happens when waiting in above freeze_queue(), the nvme error > >>>>> handling can't provide forward progress any more, because the error > >>>>> handling code path requires set->tag_list_lock. > >>>> > >>>> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues > >>> > >>> Yes. > >>> > >>>> which is called in nvme error handling path. If yes then I believe this function > >>>> could be easily modified so that it doesn't require ->tag_list_lock. > >>> > >>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". > >>> > >> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: > >> > >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >> index 777db89fdaa7..002d2fd20e0c 100644 > >> --- a/drivers/nvme/host/core.c > >> +++ b/drivers/nvme/host/core.c > >> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) > >> { > >> if (!ctrl->tagset) > >> return; > >> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) > >> - blk_mq_quiesce_tagset(ctrl->tagset); > >> - else > >> - blk_mq_wait_quiesce_done(ctrl->tagset); > >> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { > >> + struct nvme_ns *ns; > >> + int srcu_idx; > >> + > >> + srcu_idx = srcu_read_lock(&ctrl->srcu); > >> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, > >> + srcu_read_lock_held(&ctrl->srcu)) { > >> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) > >> + blk_mq_quiesce_queue_nowait(ns->queue); > >> + } > >> + srcu_read_unlock(&ctrl->srcu, srcu_idx); > >> + } > >> + blk_mq_wait_quiesce_done(ctrl->tagset); > >> } > >> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); > >> > >> Here we iterate through ctrl->namespaces instead of relying on tag_list > >> and so we don't need to acquire ->tag_list_lock. > > > > How can you make sure all NSs are covered in this way? RCU/SRCU can't > > provide such kind of guarantee. > > > Why is that so? In fact, nvme_wait_freeze also iterates through > the same ctrl->namespaces to freeze the queue. It depends if nvme error handling needs to cover new coming NS, suppose it doesn't care, and you can change to srcu and bypass ->tag_list_lock. > > >> > >>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. > >>> > >>>> > >>>>> > >>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > >>>>> fortunately that is what nvme is doing. > >>>>> > >>>>> > >>>>>> If yes then it means that we should be able to grab ->elevator_lock > >>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > >>>>>> order should be in each code path, > >>>>>> > >>>>>> __blk_mq_update_nr_hw_queues > >>>>>> ->elevator_lock > >>>>>> ->freeze_lock > >>>>> > >>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way > >>>>> just make things worse. Why can't we disable elevator switch during > >>>>> updating nr_hw_queues? > >>>>> > >>>> I couldn't quite understand this. As we already first disable the elevator > >>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). > >>>> Once mapping is successful we switch back the elevator. > >>> > >>> Yes, but user still may switch elevator from none to others during the > >>> period, right? > >>> > >> Yes correct, that's possible. So your suggestion was to disable elevator > >> update while we're running __blk_mq_update_nr_hw_queues? And that way user > >> couldn't update elevator through sysfs (elv_iosched_store) while we update > >> nr_hw_queues? If this is true then still how could it help solve lockdep > >> splat? > > > > Then why do you think per-set lock can solve the lockdep splat? > > > > __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > > involved wrt. switching elevator. If elevator switching is not allowed > > when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > > lock? > > > Yes if elevator switch is not allowed then we probably don't need per-set lock. > However my question was if we were to not allow elevator switch while > __blk_mq_update_nr_hw_queues is running then how would we implement it? It can be done easily by tag_set->srcu. > Do we need to synchronize with ->tag_list_lock? Or in another words, > elv_iosched_store would now depends on ->tag_list_lock ? ->tag_list_lock isn't involved. > > On another note, if we choose to make ->elevator_lock per-set then > our locking sequence in blk_mq_update_nr_hw_queues() would be, There is also add/del disk vs. updating nr_hw_queues, do you want to add the per-set lock in add/del disk path too? > > blk_mq_update_nr_hw_queues > -> tag_list_lock > -> elevator_lock > -> freeze_lock Actually freeze lock is already held for nvme before calling blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue frozen for updating nr_hw_queues, so the above order may not match with the existed code. Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > > elv_iosched_store > -> elevator_lock > -> freeze_lock I understand that the per-set elevator_lock is just for avoiding the nested elvevator lock class acquire? If we needn't to consider nvme or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed. It is actually easy to sync elevator store vs. update nr_hw_queues. > > So now ->freeze_lock should not depend on ->elevator_lock and that shall > help avoid few of the recent lockdep splats reported with fs_reclaim. > What do you think? Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim related splat. However, in del_gendisk(), freeze_lock is still held before calling elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. Thanks, Ming
On 4/9/25 5:16 PM, Ming Lei wrote: >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". >>>>> >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: >>>> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>> index 777db89fdaa7..002d2fd20e0c 100644 >>>> --- a/drivers/nvme/host/core.c >>>> +++ b/drivers/nvme/host/core.c >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) >>>> { >>>> if (!ctrl->tagset) >>>> return; >>>> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) >>>> - blk_mq_quiesce_tagset(ctrl->tagset); >>>> - else >>>> - blk_mq_wait_quiesce_done(ctrl->tagset); >>>> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { >>>> + struct nvme_ns *ns; >>>> + int srcu_idx; >>>> + >>>> + srcu_idx = srcu_read_lock(&ctrl->srcu); >>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, >>>> + srcu_read_lock_held(&ctrl->srcu)) { >>>> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) >>>> + blk_mq_quiesce_queue_nowait(ns->queue); >>>> + } >>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx); >>>> + } >>>> + blk_mq_wait_quiesce_done(ctrl->tagset); >>>> } >>>> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); >>>> >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list >>>> and so we don't need to acquire ->tag_list_lock. >>> >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't >>> provide such kind of guarantee. >>> >> Why is that so? In fact, nvme_wait_freeze also iterates through >> the same ctrl->namespaces to freeze the queue. > > It depends if nvme error handling needs to cover new coming NS, > suppose it doesn't care, and you can change to srcu and bypass > ->tag_list_lock. > Yes new incoming NS may not be live yet when we iterate through ctrl->namespaces. So we don't need bother about it yet. >> >>>> >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. >>>>> >>>>>> >>>>>>> >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, >>>>>>> fortunately that is what nvme is doing. >>>>>>> >>>>>>> >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking >>>>>>>> order should be in each code path, >>>>>>>> >>>>>>>> __blk_mq_update_nr_hw_queues >>>>>>>> ->elevator_lock >>>>>>>> ->freeze_lock >>>>>>> >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way >>>>>>> just make things worse. Why can't we disable elevator switch during >>>>>>> updating nr_hw_queues? >>>>>>> >>>>>> I couldn't quite understand this. As we already first disable the elevator >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). >>>>>> Once mapping is successful we switch back the elevator. >>>>> >>>>> Yes, but user still may switch elevator from none to others during the >>>>> period, right? >>>>> >>>> Yes correct, that's possible. So your suggestion was to disable elevator >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update >>>> nr_hw_queues? If this is true then still how could it help solve lockdep >>>> splat? >>> >>> Then why do you think per-set lock can solve the lockdep splat? >>> >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues >>> involved wrt. switching elevator. If elevator switching is not allowed >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set >>> lock? >>> >> Yes if elevator switch is not allowed then we probably don't need per-set lock. >> However my question was if we were to not allow elevator switch while >> __blk_mq_update_nr_hw_queues is running then how would we implement it? > > It can be done easily by tag_set->srcu. Ok great if that's possible! But I'm not sure how it could be done in this case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store run in the writer/updater context. So you may still need lock? Can you please send across a (informal) patch with your idea ? > >> Do we need to synchronize with ->tag_list_lock? Or in another words, >> elv_iosched_store would now depends on ->tag_list_lock ? > > ->tag_list_lock isn't involved. > >> >> On another note, if we choose to make ->elevator_lock per-set then >> our locking sequence in blk_mq_update_nr_hw_queues() would be, > > There is also add/del disk vs. updating nr_hw_queues, do you want to > add the per-set lock in add/del disk path too? Ideally no we don't need to acquire ->elevator_lock in this path. Please see below. >> >> blk_mq_update_nr_hw_queues >> -> tag_list_lock >> -> elevator_lock >> -> freeze_lock > > Actually freeze lock is already held for nvme before calling > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > frozen for updating nr_hw_queues, so the above order may not match > with the existed code. > > Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > I think we should consider (may be in different patch) updating nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove its dependency on ->tag_list_lock. >> >> elv_iosched_store >> -> elevator_lock >> -> freeze_lock > > I understand that the per-set elevator_lock is just for avoiding the > nested elvevator lock class acquire? If we needn't to consider nvme > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed. > > It is actually easy to sync elevator store vs. update nr_hw_queues. > >> >> So now ->freeze_lock should not depend on ->elevator_lock and that shall >> help avoid few of the recent lockdep splats reported with fs_reclaim. >> What do you think? > > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > related splat. > > However, in del_gendisk(), freeze_lock is still held before calling > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. Yes agreed, however elevator_exit() called from del_gendisk() or elv_unregister_queue() called from blk_unregister_queue() are called after we unregister the queue. And if queue has been already unregistered while invoking elevator_exit or del_gensidk then ideally we don't need to acquire ->elevator_lock. The same is true for elevator_exit() called from add_disk_fwnode(). So IMO, we should update these paths to avoid acquiring ->elevator_lock. Thanks, --Nilay
On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote: > > > On 4/9/25 5:16 PM, Ming Lei wrote: > >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". > >>>>> > >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: > >>>> > >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >>>> index 777db89fdaa7..002d2fd20e0c 100644 > >>>> --- a/drivers/nvme/host/core.c > >>>> +++ b/drivers/nvme/host/core.c > >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) > >>>> { > >>>> if (!ctrl->tagset) > >>>> return; > >>>> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) > >>>> - blk_mq_quiesce_tagset(ctrl->tagset); > >>>> - else > >>>> - blk_mq_wait_quiesce_done(ctrl->tagset); > >>>> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { > >>>> + struct nvme_ns *ns; > >>>> + int srcu_idx; > >>>> + > >>>> + srcu_idx = srcu_read_lock(&ctrl->srcu); > >>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, > >>>> + srcu_read_lock_held(&ctrl->srcu)) { > >>>> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) > >>>> + blk_mq_quiesce_queue_nowait(ns->queue); > >>>> + } > >>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx); > >>>> + } > >>>> + blk_mq_wait_quiesce_done(ctrl->tagset); > >>>> } > >>>> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); > >>>> > >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list > >>>> and so we don't need to acquire ->tag_list_lock. > >>> > >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't > >>> provide such kind of guarantee. > >>> > >> Why is that so? In fact, nvme_wait_freeze also iterates through > >> the same ctrl->namespaces to freeze the queue. > > > > It depends if nvme error handling needs to cover new coming NS, > > suppose it doesn't care, and you can change to srcu and bypass > > ->tag_list_lock. > > > Yes new incoming NS may not be live yet when we iterate through > ctrl->namespaces. So we don't need bother about it yet. > >> > >>>> > >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > >>>>>>> fortunately that is what nvme is doing. > >>>>>>> > >>>>>>> > >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock > >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > >>>>>>>> order should be in each code path, > >>>>>>>> > >>>>>>>> __blk_mq_update_nr_hw_queues > >>>>>>>> ->elevator_lock > >>>>>>>> ->freeze_lock > >>>>>>> > >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way > >>>>>>> just make things worse. Why can't we disable elevator switch during > >>>>>>> updating nr_hw_queues? > >>>>>>> > >>>>>> I couldn't quite understand this. As we already first disable the elevator > >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). > >>>>>> Once mapping is successful we switch back the elevator. > >>>>> > >>>>> Yes, but user still may switch elevator from none to others during the > >>>>> period, right? > >>>>> > >>>> Yes correct, that's possible. So your suggestion was to disable elevator > >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user > >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update > >>>> nr_hw_queues? If this is true then still how could it help solve lockdep > >>>> splat? > >>> > >>> Then why do you think per-set lock can solve the lockdep splat? > >>> > >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > >>> involved wrt. switching elevator. If elevator switching is not allowed > >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > >>> lock? > >>> > >> Yes if elevator switch is not allowed then we probably don't need per-set lock. > >> However my question was if we were to not allow elevator switch while > >> __blk_mq_update_nr_hw_queues is running then how would we implement it? > > > > It can be done easily by tag_set->srcu. > Ok great if that's possible! But I'm not sure how it could be done in this > case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store > run in the writer/updater context. So you may still need lock? Can you > please send across a (informal) patch with your idea ? Please see the attached patch which treats elv_iosched_store() as reader. > > > > >> Do we need to synchronize with ->tag_list_lock? Or in another words, > >> elv_iosched_store would now depends on ->tag_list_lock ? > > > > ->tag_list_lock isn't involved. > > > >> > >> On another note, if we choose to make ->elevator_lock per-set then > >> our locking sequence in blk_mq_update_nr_hw_queues() would be, > > > > There is also add/del disk vs. updating nr_hw_queues, do you want to > > add the per-set lock in add/del disk path too? > > Ideally no we don't need to acquire ->elevator_lock in this path. > Please see below. blk_mq_update_nr_hw_queues() can come anytime, and there is still race between blk_mq_update_nr_hw_queues and add/del disk. > > >> > >> blk_mq_update_nr_hw_queues > >> -> tag_list_lock > >> -> elevator_lock > >> -> freeze_lock > > > > Actually freeze lock is already held for nvme before calling > > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > > frozen for updating nr_hw_queues, so the above order may not match > > with the existed code. > > > > Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > > > I think we should consider (may be in different patch) updating > nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove > its dependency on ->tag_list_lock. If we need to take nvme into account, the above lock order doesn't work, because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), and elevator lock still depends on freeze lock. If it needn't to be considered, per-set lock becomes necessary. > > >> > >> elv_iosched_store > >> -> elevator_lock > >> -> freeze_lock > > > > I understand that the per-set elevator_lock is just for avoiding the > > nested elvevator lock class acquire? If we needn't to consider nvme > > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed. > > > > It is actually easy to sync elevator store vs. update nr_hw_queues. > > > >> > >> So now ->freeze_lock should not depend on ->elevator_lock and that shall > >> help avoid few of the recent lockdep splats reported with fs_reclaim. > >> What do you think? > > > > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > > related splat. > > > > However, in del_gendisk(), freeze_lock is still held before calling > > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. > > Yes agreed, however elevator_exit() called from del_gendisk() or > elv_unregister_queue() called from blk_unregister_queue() are called > after we unregister the queue. And if queue has been already unregistered > while invoking elevator_exit or del_gensidk then ideally we don't need to > acquire ->elevator_lock. The same is true for elevator_exit() called > from add_disk_fwnode(). So IMO, we should update these paths to avoid > acquiring ->elevator_lock. As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one host wide event, so either lock or srcu sync is needed. Thanks, Ming From a475139e47e745f32a68725f4abc59f9a0083d57 Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@redhat.com> Date: Tue, 1 Apr 2025 12:42:29 +0000 Subject: [PATCH] block: prevent elevator switch during updating nr_hw_queues updating nr_hw_queues is usually used for error handling code, when it doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues may change, and elevator tags depends on nr_hw_queues. Prevent elevator switch during updating nr_hw_queues by setting flag of BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during the period. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 32 ++++++++++++++++++++------------ block/elevator.c | 12 +++++++++++- include/linux/blk-mq.h | 9 ++++++++- 4 files changed, 40 insertions(+), 14 deletions(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index c308699ded58..27f984311bb7 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -180,6 +180,7 @@ static const char *const hctx_flag_name[] = { HCTX_FLAG_NAME(BLOCKING), HCTX_FLAG_NAME(TAG_RR), HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT), + HCTX_FLAG_NAME(UPDATE_HW_QUEUES), }; #undef HCTX_FLAG_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index d7a103dc258b..c1e7e1823369 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4776,14 +4776,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids) set->nr_hw_queues = nr_cpu_ids; - if (set->flags & BLK_MQ_F_BLOCKING) { - set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL); - if (!set->srcu) - return -ENOMEM; - ret = init_srcu_struct(set->srcu); - if (ret) - goto out_free_srcu; - } + set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL); + if (!set->srcu) + return -ENOMEM; + ret = init_srcu_struct(set->srcu); + if (ret) + goto out_free_srcu; ret = -ENOMEM; set->tags = kcalloc_node(set->nr_hw_queues, @@ -4864,10 +4862,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set) kfree(set->tags); set->tags = NULL; - if (set->flags & BLK_MQ_F_BLOCKING) { - cleanup_srcu_struct(set->srcu); - kfree(set->srcu); - } + + cleanup_srcu_struct(set->srcu); + kfree(set->srcu); } EXPORT_SYMBOL(blk_mq_free_tag_set); @@ -5081,7 +5078,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) { mutex_lock(&set->tag_list_lock); + /* + * Mark us in updating nr_hw_queues for preventing switching + * elevator + * + * Elevator switch code can _not_ acquire ->tag_list_lock + */ + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES; + synchronize_srcu(set->srcu); + __blk_mq_update_nr_hw_queues(set, nr_hw_queues); + + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES; mutex_unlock(&set->tag_list_lock); } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); diff --git a/block/elevator.c b/block/elevator.c index cf48613c6e62..e50e04ed15a0 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -718,9 +718,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, { char elevator_name[ELV_NAME_MAX]; char *name; - int ret; + int ret, idx; unsigned int memflags; struct request_queue *q = disk->queue; + struct blk_mq_tag_set *set = q->tag_set; /* * If the attribute needs to load a module, do it before freezing the @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, elv_iosched_load_module(name); + idx = srcu_read_lock(set->srcu); + + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) { + ret = -EBUSY; + goto exit; + } + memflags = blk_mq_freeze_queue(q); mutex_lock(&q->elevator_lock); ret = elevator_change(q, name); @@ -739,6 +747,8 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, ret = count; mutex_unlock(&q->elevator_lock); blk_mq_unfreeze_queue(q, memflags); +exit: + srcu_read_unlock(set->srcu, idx); return ret; } diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8eb9b3310167..71e05245af9d 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -681,7 +681,14 @@ enum { */ BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6, - BLK_MQ_F_MAX = 1 << 7, + /* + * True when updating nr_hw_queues is in-progress + * + * tag_set only flag, not usable for hctx + */ + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7, + + BLK_MQ_F_MAX = 1 << 8, }; #define BLK_MQ_MAX_DEPTH (10240)
On 4/9/25 7:38 PM, Ming Lei wrote: >>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues >>>>> involved wrt. switching elevator. If elevator switching is not allowed >>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set >>>>> lock? >>>>> >>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. >>>> However my question was if we were to not allow elevator switch while >>>> __blk_mq_update_nr_hw_queues is running then how would we implement it? >>> >>> It can be done easily by tag_set->srcu. >> Ok great if that's possible! But I'm not sure how it could be done in this >> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store >> run in the writer/updater context. So you may still need lock? Can you >> please send across a (informal) patch with your idea ? > > Please see the attached patch which treats elv_iosched_store() as reader. > I looked trough patch and looks good. However we still have an issue in code paths where we acquire ->elevator_lock without first freezing the queue.In this case if we try allocate memory using GFP_KERNEL then fs_reclaim would still trigger lockdep splat. As for this case the lock order would be, elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io) In fact, I tested your patch and got into the above splat. I've attached lockdep splat for your reference. So IMO, we should instead try make locking order such that ->freeze_lock shall never depend on ->elevator_lock. It seems one way to make that possible if we make ->elevator_lock per-set. >>> >>> Actually freeze lock is already held for nvme before calling >>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue >>> frozen for updating nr_hw_queues, so the above order may not match >>> with the existed code. >>> >>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now? >>> >> I think we should consider (may be in different patch) updating >> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove >> its dependency on ->tag_list_lock. > > If we need to take nvme into account, the above lock order doesn't work, > because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), > and elevator lock still depends on freeze lock. > > If it needn't to be considered, per-set lock becomes necessary. IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues we shall also update nvme pci, rdma and tcp drivers so that those drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues. >>>> >>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall >>>> help avoid few of the recent lockdep splats reported with fs_reclaim. >>>> What do you think? >>> >>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim >>> related splat. >>> >>> However, in del_gendisk(), freeze_lock is still held before calling >>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. >> >> Yes agreed, however elevator_exit() called from del_gendisk() or >> elv_unregister_queue() called from blk_unregister_queue() are called >> after we unregister the queue. And if queue has been already unregistered >> while invoking elevator_exit or del_gensidk then ideally we don't need to >> acquire ->elevator_lock. The same is true for elevator_exit() called >> from add_disk_fwnode(). So IMO, we should update these paths to avoid >> acquiring ->elevator_lock. > > As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one > host wide event, so either lock or srcu sync is needed. Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel with del_gendisk or blk_unregister_queue. Thanks, --Nilay [ 399.112145] ====================================================== [ 399.112153] WARNING: possible circular locking dependency detected [ 399.112160] 6.15.0-rc1+ #159 Not tainted [ 399.112167] ------------------------------------------------------ [ 399.112174] bash/5891 is trying to acquire lock: [ 399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4 [ 399.112205] [ 399.112205] but task is already holding lock: [ 399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 [ 399.112231] [ 399.112231] which lock already depends on the new lock. [ 399.112231] [ 399.112239] [ 399.112239] the existing dependency chain (in reverse order) is: [ 399.112246] [ 399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}: [ 399.112260] blk_alloc_queue+0x3a8/0x3e4 [ 399.112268] blk_mq_alloc_queue+0x88/0x11c [ 399.112278] __blk_mq_alloc_disk+0x34/0xd8 [ 399.112290] null_add_dev+0x3c8/0x914 [null_blk] [ 399.112306] null_init+0x1e0/0x4bc [null_blk] [ 399.112319] do_one_initcall+0x8c/0x4b8 [ 399.112340] do_init_module+0x7c/0x2c4 [ 399.112396] init_module_from_file+0xb4/0x108 [ 399.112405] idempotent_init_module+0x26c/0x368 [ 399.112414] sys_finit_module+0x98/0x150 [ 399.112422] system_call_exception+0x134/0x360 [ 399.112430] system_call_vectored_common+0x15c/0x2ec [ 399.112441] [ 399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 399.112453] fs_reclaim_acquire+0xe4/0x120 [ 399.112461] kmem_cache_alloc_noprof+0x74/0x570 [ 399.112469] __kernfs_new_node+0x98/0x37c [ 399.112479] kernfs_new_node+0x94/0xe4 [ 399.112490] kernfs_create_dir_ns+0x44/0x120 [ 399.112501] sysfs_create_dir_ns+0x94/0x160 [ 399.112512] kobject_add_internal+0xf4/0x3c8 [ 399.112524] kobject_add+0x70/0x10c [ 399.112533] elv_register_queue+0x70/0x14c [ 399.112562] blk_register_queue+0x1d8/0x2bc [ 399.112575] add_disk_fwnode+0x3b4/0x5d0 [ 399.112588] sd_probe+0x3bc/0x5b4 [sd_mod] [ 399.112601] really_probe+0x104/0x4c4 [ 399.112613] __driver_probe_device+0xb8/0x200 [ 399.112623] driver_probe_device+0x54/0x128 [ 399.112632] __driver_attach_async_helper+0x7c/0x150 [ 399.112641] async_run_entry_fn+0x60/0x1bc [ 399.112665] process_one_work+0x2ac/0x7e4 [ 399.112678] worker_thread+0x238/0x460 [ 399.112691] kthread+0x158/0x188 [ 399.112700] start_kernel_thread+0x14/0x18 [ 399.112722] [ 399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}: [ 399.112737] __lock_acquire+0x1bec/0x2bc8 [ 399.112747] lock_acquire+0x140/0x430 [ 399.112760] __mutex_lock+0xf0/0xb00 [ 399.112769] elv_iosched_store+0x11c/0x5d4 [ 399.112781] queue_attr_store+0x12c/0x164 [ 399.112792] sysfs_kf_write+0x74/0xc4 [ 399.112804] kernfs_fop_write_iter+0x1a8/0x2a4 [ 399.112816] vfs_write+0x410/0x584 [ 399.112829] ksys_write+0x84/0x140 [ 399.112841] system_call_exception+0x134/0x360 [ 399.112851] system_call_vectored_common+0x15c/0x2ec [ 399.112863] [ 399.112863] other info that might help us debug this: [ 399.112863] [ 399.112874] Chain exists of: [ 399.112874] &q->elevator_lock --> fs_reclaim --> &q->q_usage_counter(io)#20 [ 399.112874] [ 399.112896] Possible unsafe locking scenario: [ 399.112896] [ 399.112905] CPU0 CPU1 [ 399.112928] ---- ---- [ 399.112936] lock(&q->q_usage_counter(io)#20); [ 399.112954] lock(fs_reclaim); [ 399.112967] lock(&q->q_usage_counter(io)#20); [ 399.112987] lock(&q->elevator_lock); [ 399.112998] [ 399.112998] *** DEADLOCK *** [ 399.112998] [ 399.113010] 5 locks held by bash/5891: [ 399.113017] #0: c000000009f87400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140 [ 399.113042] #1: c0000000097cfa88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x164/0x2a4 [ 399.113064] #2: c0000000ba4634f8 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x170/0x2a4 [ 399.113083] #3: c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 [ 399.113102] #4: c0000000bc233010 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 [ 399.113120] [ 399.113120] stack backtrace: [ 399.113126] CPU: 24 UID: 0 PID: 5891 Comm: bash Kdump: loaded Not tainted 6.15.0-rc1+ #159 VOLUNTARY [ 399.113130] Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries [ 399.113132] Call Trace: [ 399.113133] [c000000082bd7580] [c0000000011b69f8] dump_stack_lvl+0x108/0x18c (unreliable) [ 399.113141] [c000000082bd75b0] [c00000000022512c] print_circular_bug+0x448/0x604 [ 399.113146] [c000000082bd7660] [c000000000225534] check_noncircular+0x24c/0x26c [ 399.113150] [c000000082bd7730] [c00000000022b998] __lock_acquire+0x1bec/0x2bc8 [ 399.113155] [c000000082bd7860] [c000000000228d20] lock_acquire+0x140/0x430 [ 399.113159] [c000000082bd7960] [c0000000011f76e8] __mutex_lock+0xf0/0xb00 [ 399.113163] [c000000082bd7a90] [c00000000090a558] elv_iosched_store+0x11c/0x5d4 [ 399.113169] [c000000082bd7b50] [c000000000912d20] queue_attr_store+0x12c/0x164 [ 399.113174] [c000000082bd7c60] [c0000000007d9114] sysfs_kf_write+0x74/0xc4 [ 399.113179] [c000000082bd7ca0] [c0000000007d5b60] kernfs_fop_write_iter+0x1a8/0x2a4 [ 399.113185] [c000000082bd7cf0] [c0000000006b386c] vfs_write+0x410/0x584 [ 399.113190] [c000000082bd7dc0] [c0000000006b3d18] ksys_write+0x84/0x140 [ 399.113196] [c000000082bd7e10] [c000000000031814] system_call_exception+0x134/0x360 [ 399.113199] [c000000082bd7e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec [ 399.113205] --- interrupt: 3000 at 0x7fffb113b034 [ 399.113208] NIP: 00007fffb113b034 LR: 00007fffb113b034 CTR: 0000000000000000 [ 399.113211] REGS: c000000082bd7e80 TRAP: 3000 Not tainted (6.15.0-rc1+) [ 399.113213] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44422408 XER: 00000000 [ 399.113223] IRQMASK: 0 [ 399.113223] GPR00: 0000000000000004 00007fffe78cf020 00000001114d7e00 0000000000000001 [ 399.113223] GPR04: 000000014ce03ae0 000000000000000c 0000000000000010 0000000000000001 [ 399.113223] GPR08: 000000000000000b 0000000000000000 0000000000000000 0000000000000000 [ 399.113223] GPR12: 0000000000000000 00007fffb139ab60 000000014cef4ed0 00000001114d87b8 [ 399.113223] GPR16: 00000001114d94d8 0000000020000000 0000000000000000 00000001113e9070 [ 399.113223] GPR20: 000000011147beb8 00007fffe78cf1c4 000000011147f8a0 00000001114d89bc [ 399.113223] GPR24: 00000001114d8a50 0000000000000000 000000014ce03ae0 000000000000000c [ 399.113223] GPR28: 000000000000000c 00007fffb12418e0 000000014ce03ae0 000000000000000c [ 399.113256] NIP [00007fffb113b034] 0x7fffb113b034 [ 399.113258] LR [00007fffb113b034] 0x7fffb113b034 [ 399.113260] --- interrupt: 3000
On Thu, Apr 10, 2025 at 01:15:22AM +0530, Nilay Shroff wrote: > > > On 4/9/25 7:38 PM, Ming Lei wrote: > >>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > >>>>> involved wrt. switching elevator. If elevator switching is not allowed > >>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > >>>>> lock? > >>>>> > >>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. > >>>> However my question was if we were to not allow elevator switch while > >>>> __blk_mq_update_nr_hw_queues is running then how would we implement it? > >>> > >>> It can be done easily by tag_set->srcu. > >> Ok great if that's possible! But I'm not sure how it could be done in this > >> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store > >> run in the writer/updater context. So you may still need lock? Can you > >> please send across a (informal) patch with your idea ? > On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote: >> Yes new incoming NS may not be live yet when we iterate through >> ctrl->namespaces. So we don't need bother about it yet. If new NS needn't to be considered, the issue is easier to deal with updating nr_hw_queues, such as: - disable scheduler in 1st iterating tag_list - update nr_hw_queues in 2nd iterating tag_list - restore scheduler in 3rd iterating tag_list ->tag_list_lock is acquired & released in each iteration. Then per-set lock isn't needed any more. > > > > Please see the attached patch which treats elv_iosched_store() as reader. > > > I looked trough patch and looks good. However we still have an issue > in code paths where we acquire ->elevator_lock without first freezing > the queue.In this case if we try allocate memory using GFP_KERNEL > then fs_reclaim would still trigger lockdep splat. As for this case > the lock order would be, > > elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io) That is why I call ->elevator_lock is used too much, especially it is depended for dealing with kobject & sysfs, which isn't needed in this way. > > In fact, I tested your patch and got into the above splat. I've > attached lockdep splat for your reference. So IMO, we should instead > try make locking order such that ->freeze_lock shall never depend on > ->elevator_lock. It seems one way to make that possible if we make > ->elevator_lock per-set. The attached patch is just for avoiding race between add/del disk vs. updating nr_hw_queues, and it should have been for covering race between adding/exiting request queue vs. updating nr_hw_queues. If re-order can be done, that is definitely good, but... > > >>> > >>> Actually freeze lock is already held for nvme before calling > >>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > >>> frozen for updating nr_hw_queues, so the above order may not match > >>> with the existed code. > >>> > >>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > >>> > >> I think we should consider (may be in different patch) updating > >> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove > >> its dependency on ->tag_list_lock. > > > > If we need to take nvme into account, the above lock order doesn't work, > > because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), > > and elevator lock still depends on freeze lock. > > > > If it needn't to be considered, per-set lock becomes necessary. > IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues > we shall also update nvme pci, rdma and tcp drivers so that those > drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues > nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that > nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues. This way you cause new deadlock or new trouble if you reply on freeze in blk_mq_update_nr_hw_queues(): ->tag_list_lock freeze_lock If timeout or io failure happens during the above freeze_lock, timeout handler can not move on because new blk_mq_update_nr_hw_queues() can't grab the lock. Either deadlock or device has to been removed. > > >>>> > >>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall > >>>> help avoid few of the recent lockdep splats reported with fs_reclaim. > >>>> What do you think? > >>> > >>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > >>> related splat. > >>> > >>> However, in del_gendisk(), freeze_lock is still held before calling > >>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. > >> > >> Yes agreed, however elevator_exit() called from del_gendisk() or > >> elv_unregister_queue() called from blk_unregister_queue() are called > >> after we unregister the queue. And if queue has been already unregistered > >> while invoking elevator_exit or del_gensidk then ideally we don't need to > >> acquire ->elevator_lock. The same is true for elevator_exit() called > >> from add_disk_fwnode(). So IMO, we should update these paths to avoid > >> acquiring ->elevator_lock. > > > > As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one > > host wide event, so either lock or srcu sync is needed. > Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel > with del_gendisk or blk_unregister_queue. Then the per-set lock is required in both add/del disk, then how to re-order freeze_lock & elevator lock in del_gendisk()? And there is same risk with the one in blk_mq_update_nr_hw_queues(). > > Thanks, > --Nilay > [ 399.112145] ====================================================== > [ 399.112153] WARNING: possible circular locking dependency detected > [ 399.112160] 6.15.0-rc1+ #159 Not tainted > [ 399.112167] ------------------------------------------------------ > [ 399.112174] bash/5891 is trying to acquire lock: > [ 399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4 > [ 399.112205] > [ 399.112205] but task is already holding lock: > [ 399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 > [ 399.112231] > [ 399.112231] which lock already depends on the new lock. > [ 399.112231] > [ 399.112239] > [ 399.112239] the existing dependency chain (in reverse order) is: > [ 399.112246] > [ 399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}: > [ 399.112260] blk_alloc_queue+0x3a8/0x3e4 > [ 399.112268] blk_mq_alloc_queue+0x88/0x11c > [ 399.112278] __blk_mq_alloc_disk+0x34/0xd8 > [ 399.112290] null_add_dev+0x3c8/0x914 [null_blk] > [ 399.112306] null_init+0x1e0/0x4bc [null_blk] > [ 399.112319] do_one_initcall+0x8c/0x4b8 > [ 399.112340] do_init_module+0x7c/0x2c4 > [ 399.112396] init_module_from_file+0xb4/0x108 > [ 399.112405] idempotent_init_module+0x26c/0x368 > [ 399.112414] sys_finit_module+0x98/0x150 > [ 399.112422] system_call_exception+0x134/0x360 > [ 399.112430] system_call_vectored_common+0x15c/0x2ec > [ 399.112441] > [ 399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}: > [ 399.112453] fs_reclaim_acquire+0xe4/0x120 > [ 399.112461] kmem_cache_alloc_noprof+0x74/0x570 > [ 399.112469] __kernfs_new_node+0x98/0x37c > [ 399.112479] kernfs_new_node+0x94/0xe4 > [ 399.112490] kernfs_create_dir_ns+0x44/0x120 > [ 399.112501] sysfs_create_dir_ns+0x94/0x160 > [ 399.112512] kobject_add_internal+0xf4/0x3c8 > [ 399.112524] kobject_add+0x70/0x10c > [ 399.112533] elv_register_queue+0x70/0x14c > [ 399.112562] blk_register_queue+0x1d8/0x2bc > [ 399.112575] add_disk_fwnode+0x3b4/0x5d0 > [ 399.112588] sd_probe+0x3bc/0x5b4 [sd_mod] > [ 399.112601] really_probe+0x104/0x4c4 > [ 399.112613] __driver_probe_device+0xb8/0x200 > [ 399.112623] driver_probe_device+0x54/0x128 > [ 399.112632] __driver_attach_async_helper+0x7c/0x150 > [ 399.112641] async_run_entry_fn+0x60/0x1bc > [ 399.112665] process_one_work+0x2ac/0x7e4 > [ 399.112678] worker_thread+0x238/0x460 > [ 399.112691] kthread+0x158/0x188 > [ 399.112700] start_kernel_thread+0x14/0x18 > [ 399.112722] > [ 399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}: Another ways is to make sure that ->elevator_lock isn't required for dealing with kobject/debugfs thing, which needs to refactor elevator code. Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs actually need to drain reader, that is deadlock too. Thanks, Ming
On 4/10/25 7:40 AM, Ming Lei wrote: > > If new NS needn't to be considered, the issue is easier to deal with updating > nr_hw_queues, such as: > > - disable scheduler in 1st iterating tag_list > > - update nr_hw_queues in 2nd iterating tag_list > > - restore scheduler in 3rd iterating tag_list > > ->tag_list_lock is acquired & released in each iteration. > > Then per-set lock isn't needed any more. > Hmm still thinking... >>> >>> Please see the attached patch which treats elv_iosched_store() as reader. >>> >> I looked trough patch and looks good. However we still have an issue >> in code paths where we acquire ->elevator_lock without first freezing >> the queue.In this case if we try allocate memory using GFP_KERNEL >> then fs_reclaim would still trigger lockdep splat. As for this case >> the lock order would be, >> >> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io) > > That is why I call ->elevator_lock is used too much, especially it is > depended for dealing with kobject & sysfs, which isn't needed in this way. > >> >> In fact, I tested your patch and got into the above splat. I've >> attached lockdep splat for your reference. So IMO, we should instead >> try make locking order such that ->freeze_lock shall never depend on >> ->elevator_lock. It seems one way to make that possible if we make >> ->elevator_lock per-set. > > The attached patch is just for avoiding race between add/del disk vs. > updating nr_hw_queues, and it should have been for covering race between > adding/exiting request queue vs. updating nr_hw_queues. > Okay understood. > If re-order can be done, that is definitely good, but... Yeah so I tried re-ordering locks so that we first grab q->elevator_lock and then ->freeze_lock. As we know there's a challenge with re-arranging the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify the lock order. I added one more tag_list iterator where we first acquire ->elevator lock and then in the next tag_list iterator we acquire ->freeze_lock. I have also updated this lock order at other call sites. Then as we have already discussed, there's also another challenge re-arranging the lock order in del_gendisk, however, I managed to mitigate that by moving elevator_exit and elv_unregister_queue (from blk_unregister_queue) just after we delete queue tag set (or in another words when remove the queue from the tag-list) in del_gendisk. So that means that we could now safely invoke elv_unregister_queue and elevator_exit from del_gendisk without needing to acquire ->elevator_lock. For reference, I attached a (informal) patch. Yes I know this patch would not fix all splats. But at-least we would stop observing splat related to ->frezze_lock dependency on ->elevator_lcok. > >> >>>>> >>>>> Actually freeze lock is already held for nvme before calling >>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue >>>>> frozen for updating nr_hw_queues, so the above order may not match >>>>> with the existed code. >>>>> >>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now? >>>>> >>>> I think we should consider (may be in different patch) updating >>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove >>>> its dependency on ->tag_list_lock. >>> >>> If we need to take nvme into account, the above lock order doesn't work, >>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), >>> and elevator lock still depends on freeze lock. >>> >>> If it needn't to be considered, per-set lock becomes necessary. >> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues >> we shall also update nvme pci, rdma and tcp drivers so that those >> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues >> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that >> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues. > > This way you cause new deadlock or new trouble if you reply on freeze in > blk_mq_update_nr_hw_queues(): > > ->tag_list_lock > freeze_lock > > If timeout or io failure happens during the above freeze_lock, timeout > handler can not move on because new blk_mq_update_nr_hw_queues() can't > grab the lock. > > Either deadlock or device has to been removed. > With this new attached patch do you still foresee this issue? Yes this patch doesn't change anything with nvme driver, but later we may update nvme driver so that nvme driver doesn't require freezing queue before invoking blk_mq_ update_nr_hw_queues. I think this is requires so that we follow the same lock order in all call paths wrt ->elevator_lock and ->freeze_lock. >>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one >>> host wide event, so either lock or srcu sync is needed. >> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel >> with del_gendisk or blk_unregister_queue. > > Then the per-set lock is required in both add/del disk, then how to re-order > freeze_lock & elevator lock in del_gendisk()? > > And there is same risk with the one in blk_mq_update_nr_hw_queues(). > Yes please see above as I explained how we could potentially avoid lock order issue in del_gendisk. > > Another ways is to make sure that ->elevator_lock isn't required for dealing > with kobject/debugfs thing, which needs to refactor elevator code. > > Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs > actually need to drain reader, that is deadlock too. > I do agree. But I think lets first focus on cutting dependency of ->freeze_lock on ->elevator_lock. Once we get past it we may address other. This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier") has now opened up pandora's box of lockdep splats :) Anyways it's good that we could now catch these issues early on. In general, I feel that now this change would show up early on the issues where ->freeze_lock depends on any other locks. Thanks, --Nilay
On Thu, Apr 10, 2025 at 07:06:23PM +0530, Nilay Shroff wrote: > > > On 4/10/25 7:40 AM, Ming Lei wrote: > > > > If new NS needn't to be considered, the issue is easier to deal with updating > > nr_hw_queues, such as: > > > > - disable scheduler in 1st iterating tag_list > > > > - update nr_hw_queues in 2nd iterating tag_list > > > > - restore scheduler in 3rd iterating tag_list > > > > ->tag_list_lock is acquired & released in each iteration. > > > > Then per-set lock isn't needed any more. > > > Hmm still thinking... > >>> > >>> Please see the attached patch which treats elv_iosched_store() as reader. > >>> > >> I looked trough patch and looks good. However we still have an issue > >> in code paths where we acquire ->elevator_lock without first freezing > >> the queue.In this case if we try allocate memory using GFP_KERNEL > >> then fs_reclaim would still trigger lockdep splat. As for this case > >> the lock order would be, > >> > >> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io) > > > > That is why I call ->elevator_lock is used too much, especially it is > > depended for dealing with kobject & sysfs, which isn't needed in this way. > > > >> > >> In fact, I tested your patch and got into the above splat. I've > >> attached lockdep splat for your reference. So IMO, we should instead > >> try make locking order such that ->freeze_lock shall never depend on > >> ->elevator_lock. It seems one way to make that possible if we make > >> ->elevator_lock per-set. > > > > The attached patch is just for avoiding race between add/del disk vs. > > updating nr_hw_queues, and it should have been for covering race between > > adding/exiting request queue vs. updating nr_hw_queues. > > > Okay understood. > > > If re-order can be done, that is definitely good, but... > Yeah so I tried re-ordering locks so that we first grab q->elevator_lock > and then ->freeze_lock. As we know there's a challenge with re-arranging > the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify > the lock order. I added one more tag_list iterator where we first acquire > ->elevator lock and then in the next tag_list iterator we acquire > ->freeze_lock. I have also updated this lock order at other call sites. > > Then as we have already discussed, there's also another challenge re-arranging > the lock order in del_gendisk, however, I managed to mitigate that by moving > elevator_exit and elv_unregister_queue (from blk_unregister_queue) just after > we delete queue tag set (or in another words when remove the queue from the > tag-list) in del_gendisk. So that means that we could now safely invoke > elv_unregister_queue and elevator_exit from del_gendisk without needing to > acquire ->elevator_lock. > > For reference, I attached a (informal) patch. Yes I know this patch would not > fix all splats. But at-least we would stop observing splat related to > ->frezze_lock dependency on ->elevator_lcok. I just sent out the whole patchset, which did one thing basically: move kobject & debugfs & cpuhp out of queue freezing, then no any lockdep is observed in my test VM after running blktests of './check block/'. So the point is _not_ related with elevator lock or its order with freeze lock. What matters is actually "do not connect freeze lock with other subsystem(debugfs, sysfs, cpuhp, ...)", because freeze_lock relies on fs_reclaim directly with commit ffa1e7ada456, but other subsystem easily depends on fs_reclaim again. I did have patches to follow your idea to reorder elevator lock vs. freeze lock, but it didn't make difference, even though after I killed most of elevator_lock. Finally I realized the above point. > > > > >> > >>>>> > >>>>> Actually freeze lock is already held for nvme before calling > >>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > >>>>> frozen for updating nr_hw_queues, so the above order may not match > >>>>> with the existed code. > >>>>> > >>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > >>>>> > >>>> I think we should consider (may be in different patch) updating > >>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove > >>>> its dependency on ->tag_list_lock. > >>> > >>> If we need to take nvme into account, the above lock order doesn't work, > >>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), > >>> and elevator lock still depends on freeze lock. > >>> > >>> If it needn't to be considered, per-set lock becomes necessary. > >> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues > >> we shall also update nvme pci, rdma and tcp drivers so that those > >> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues > >> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that > >> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues. > > > > This way you cause new deadlock or new trouble if you reply on freeze in > > blk_mq_update_nr_hw_queues(): > > > > ->tag_list_lock > > freeze_lock > > > > If timeout or io failure happens during the above freeze_lock, timeout > > handler can not move on because new blk_mq_update_nr_hw_queues() can't > > grab the lock. > > > > Either deadlock or device has to been removed. > > > With this new attached patch do you still foresee this issue? Yes this patch > doesn't change anything with nvme driver, but later we may update nvme driver > so that nvme driver doesn't require freezing queue before invoking blk_mq_ > update_nr_hw_queues. I think this is requires so that we follow the same lock > order in all call paths wrt ->elevator_lock and ->freeze_lock. nvme freeze lock is non_owner, and it can't be verified now, so don't use nvme for running this lock test. > > >>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one > >>> host wide event, so either lock or srcu sync is needed. > >> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel > >> with del_gendisk or blk_unregister_queue. > > > > Then the per-set lock is required in both add/del disk, then how to re-order > > freeze_lock & elevator lock in del_gendisk()? > > > > And there is same risk with the one in blk_mq_update_nr_hw_queues(). > > > Yes please see above as I explained how we could potentially avoid lock order > issue in del_gendisk. > > > > Another ways is to make sure that ->elevator_lock isn't required for dealing > > with kobject/debugfs thing, which needs to refactor elevator code. > > > > Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs > > actually need to drain reader, that is deadlock too. > > > I do agree. But I think lets first focus on cutting dependency of ->freeze_lock > on ->elevator_lock. Once we get past it we may address other. > > This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up > earlier") has now opened up pandora's box of lockdep splats :) > Anyways it's good that we could now catch these issues early on. In general, > I feel that now this change would show up early on the issues where ->freeze_lock > depends on any other locks. It also shows block layer freeze queue API should be used very carefully, fortunately we have the powerful lockdep. Thanks, Ming
On 4/10/25 7:53 PM, Ming Lei wrote: > > I just sent out the whole patchset, which did one thing basically: move kobject > & debugfs & cpuhp out of queue freezing, then no any lockdep is observed in my > test VM after running blktests of './check block/'. Sure, I think our emails crossed over. I will review your changes. > > So the point is _not_ related with elevator lock or its order with > freeze lock. What matters is actually "do not connect freeze lock with > other subsystem(debugfs, sysfs, cpuhp, ...)", because freeze_lock relies > on fs_reclaim directly with commit ffa1e7ada456, but other subsystem easily > depends on fs_reclaim again. > > I did have patches to follow your idea to reorder elevator lock vs. freeze > lock, but it didn't make difference, even though after I killed most of > elevator_lock. > > Finally I realized the above point. Ok great:) Thanks, --Nilay
diff --git a/block/blk-mq.c b/block/blk-mq.c index ae8494d88897..d7a103dc258b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4465,14 +4465,12 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx( return NULL; } -static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, - struct request_queue *q) +static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, + struct request_queue *q) { struct blk_mq_hw_ctx *hctx; unsigned long i, j; - /* protect against switching io scheduler */ - mutex_lock(&q->elevator_lock); for (i = 0; i < set->nr_hw_queues; i++) { int old_node; int node = blk_mq_get_hctx_node(set, i); @@ -4505,7 +4503,19 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, xa_for_each_start(&q->hctx_table, j, hctx, j) blk_mq_exit_hctx(q, set, hctx, j); - mutex_unlock(&q->elevator_lock); +} + +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, + struct request_queue *q, bool lock) +{ + if (lock) { + /* protect against switching io scheduler */ + mutex_lock(&q->elevator_lock); + __blk_mq_realloc_hw_ctxs(set, q); + mutex_unlock(&q->elevator_lock); + } else { + __blk_mq_realloc_hw_ctxs(set, q); + } /* unregister cpuhp callbacks for exited hctxs */ blk_mq_remove_hw_queues_cpuhp(q); @@ -4537,7 +4547,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, xa_init(&q->hctx_table); - blk_mq_realloc_hw_ctxs(set, q); + blk_mq_realloc_hw_ctxs(set, q, false); if (!q->nr_hw_queues) goto err_hctxs; @@ -5033,7 +5043,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, fallback: blk_mq_update_queue_map(set); list_for_each_entry(q, &set->tag_list, tag_set_list) { - blk_mq_realloc_hw_ctxs(set, q); + blk_mq_realloc_hw_ctxs(set, q, true); if (q->nr_hw_queues != set->nr_hw_queues) { int i = prev_nr_hw_queues;
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. queue freeze lock depends on fs_reclaim. So don't grab elevator lock during queue initialization which needs to call kmalloc(GFP_KERNEL), and we can cut the dependency between ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. This way is safe because elevator setting isn't ready to run during queue initialization. There isn't such issue in __blk_mq_update_nr_hw_queues() because memalloc_noio_save() is called before acquiring elevator lock. Fixes the following lockdep warning: https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)