diff mbox series

block: don't grab elevator lock during queue initialization

Message ID 20250403105402.1334206-1-ming.lei@redhat.com (mailing list archive)
State New
Headers show
Series block: don't grab elevator lock during queue initialization | expand

Commit Message

Ming Lei April 3, 2025, 10:54 a.m. UTC
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.

queue freeze lock depends on fs_reclaim.

So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.

This way is safe because elevator setting isn't ready to run during
queue initialization.

There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.

Fixes the following lockdep warning:

https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/

Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

Comments

Nilay Shroff April 3, 2025, 1:19 p.m. UTC | #1
On 4/3/25 4:24 PM, Ming Lei wrote:
> ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> 
> queue freeze lock depends on fs_reclaim.
> 
> So don't grab elevator lock during queue initialization which needs to
> call kmalloc(GFP_KERNEL), and we can cut the dependency between
> ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> 
> This way is safe because elevator setting isn't ready to run during
> queue initialization.
> 
> There isn't such issue in __blk_mq_update_nr_hw_queues() because
> memalloc_noio_save() is called before acquiring elevator lock.
> 
> Fixes the following lockdep warning:
> 
> https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
> 
> Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
> Cc: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
I think you earlier posted this same patch here:
https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/

I tested that earlier patch and got into another lockdep splat as reported here: 
https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/

I don't know why we think your earlier fix which cut dependency between 
->elevator_lock and ->freeze_lock doesn't work. But anyways, my view
is that we've got into these lock chains from two different code paths:

path1: elevator_lock 
         -> fs_reclaim (GFP_KERNEL)
           -> freeze_lock

path2: freeze_lock(memalloc_noio) 
         -> elevator_lock 
           -> fs_reclaim (this becomes NOP in this case due to memalloc_noio)

As you could see above, we've got into circular dependency between 
->elevator_lock and ->freeze_lock. I thought with your earlier patch
we were able to cut that dependency using blk_mq_enter_no_io and 
blk_mq_exit_no_io. But I'm not sure why we think that won't work?

Thanks,
--Nilay
Ming Lei April 3, 2025, 2:24 p.m. UTC | #2
On Thu, Apr 03, 2025 at 06:49:05PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/3/25 4:24 PM, Ming Lei wrote:
> > ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> > 
> > queue freeze lock depends on fs_reclaim.
> > 
> > So don't grab elevator lock during queue initialization which needs to
> > call kmalloc(GFP_KERNEL), and we can cut the dependency between
> > ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> > 
> > This way is safe because elevator setting isn't ready to run during
> > queue initialization.
> > 
> > There isn't such issue in __blk_mq_update_nr_hw_queues() because
> > memalloc_noio_save() is called before acquiring elevator lock.
> > 
> > Fixes the following lockdep warning:
> > 
> > https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
> > 
> > Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
> > Cc: Nilay Shroff <nilay@linux.ibm.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 24 +++++++++++++++++-------
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> I think you earlier posted this same patch here:
> https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/
> 
> I tested that earlier patch and got into another lockdep splat as reported here: 
> https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/

That is another different one, let's fix this one first.

The ->elevator_lock in blk_register_queue() should be for avoiding race
with updating nr_hw_queues, right?

That is why I think the dependency between elevator lock and freeze lock
is one trouble.

> 
> I don't know why we think your earlier fix which cut dependency between 
> ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view
> is that we've got into these lock chains from two different code paths:

As I explained, blk_mq_enter_no_io() is same with freeze queue, just the
lock in __bio_queue_enter() isn't modeled. If it is done, every lockdep
warning will be re-triggered too.

> 
> path1: elevator_lock 
>          -> fs_reclaim (GFP_KERNEL)
>            -> freeze_lock
> 
> path2: freeze_lock(memalloc_noio) 
>          -> elevator_lock 
>            -> fs_reclaim (this becomes NOP in this case due to memalloc_noio)

No, there isn't fs_reclaim in path2, and memalloc_noio() will avoid it.


Thanks,
Ming
Jens Axboe April 3, 2025, 2:32 p.m. UTC | #3
On Thu, 03 Apr 2025 18:54:02 +0800, Ming Lei wrote:
> ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> 
> queue freeze lock depends on fs_reclaim.
> 
> So don't grab elevator lock during queue initialization which needs to
> call kmalloc(GFP_KERNEL), and we can cut the dependency between
> ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> 
> [...]

Applied, thanks!

[1/1] block: don't grab elevator lock during queue initialization
      commit: 01b91bf14f6d4893e03e357006e7af3a20c03fee

Best regards,
Christoph Hellwig April 4, 2025, 9:10 a.m. UTC | #4
On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> Fixes the following lockdep warning:

Please spell the actual dependency out here, links are not permanent
and also not readable for any offline reading of the commit logs.

> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> +				   struct request_queue *q, bool lock)
> +{
> +	if (lock) {

bool lock(ed) arguments are an anti-pattern, and regularly get Linus
screaming at you (in this case even for the right reason :))

> +		/* protect against switching io scheduler  */
> +		mutex_lock(&q->elevator_lock);
> +		__blk_mq_realloc_hw_ctxs(set, q);
> +		mutex_unlock(&q->elevator_lock);
> +	} else {
> +		__blk_mq_realloc_hw_ctxs(set, q);
> +	}

I think the problem here is again that because of all the other
dependencies elevator_lock really needs to be per-set instead of
per-queue which will allows us to have much saner locking hierarchies.
Ming Lei April 4, 2025, 12:09 p.m. UTC | #5
On Fri, Apr 04, 2025 at 11:10:37AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> > Fixes the following lockdep warning:
> 
> Please spell the actual dependency out here, links are not permanent
> and also not readable for any offline reading of the commit logs.
> 
> > +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > +				   struct request_queue *q, bool lock)
> > +{
> > +	if (lock) {
> 
> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> screaming at you (in this case even for the right reason :))
> 
> > +		/* protect against switching io scheduler  */
> > +		mutex_lock(&q->elevator_lock);
> > +		__blk_mq_realloc_hw_ctxs(set, q);
> > +		mutex_unlock(&q->elevator_lock);
> > +	} else {
> > +		__blk_mq_realloc_hw_ctxs(set, q);
> > +	}
> 
> I think the problem here is again that because of all the other
> dependencies elevator_lock really needs to be per-set instead of
> per-queue which will allows us to have much saner locking hierarchies.

per-set lock is required in error handling code path, anywhere it is used
with freezing together can be one deadlock risk if hardware error happens.

Or can you explain the idea in details?

Thanks,
Ming
Nilay Shroff April 5, 2025, 2 p.m. UTC | #6
>> I don't know why we think your earlier fix which cut dependency between 
>> ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view
>> is that we've got into these lock chains from two different code paths:
> 
> As I explained, blk_mq_enter_no_io() is same with freeze queue, just the
> lock in __bio_queue_enter() isn't modeled. If it is done, every lockdep
> warning will be re-triggered too.
> 
Oh I see, because I tested your earlier patches without lock modeled I 
didn't encounter any lockdep warning.

>>
>> path1: elevator_lock 
>>          -> fs_reclaim (GFP_KERNEL)
>>            -> freeze_lock
>>
>> path2: freeze_lock(memalloc_noio) 
>>          -> elevator_lock 
>>            -> fs_reclaim (this becomes NOP in this case due to memalloc_noio)
> 
> No, there isn't fs_reclaim in path2, and memalloc_noio() will avoid it.
> 
Yes correct and so I mentioned above NOP for fs_reclaim in path2.

Thanks,
--Nilay
Nilay Shroff April 5, 2025, 2:14 p.m. UTC | #7
On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
>> Fixes the following lockdep warning:
> 
> Please spell the actual dependency out here, links are not permanent
> and also not readable for any offline reading of the commit logs.
> 
>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>> +				   struct request_queue *q, bool lock)
>> +{
>> +	if (lock) {
> 
> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> screaming at you (in this case even for the right reason :))
> 
>> +		/* protect against switching io scheduler  */
>> +		mutex_lock(&q->elevator_lock);
>> +		__blk_mq_realloc_hw_ctxs(set, q);
>> +		mutex_unlock(&q->elevator_lock);
>> +	} else {
>> +		__blk_mq_realloc_hw_ctxs(set, q);
>> +	}
> 
> I think the problem here is again that because of all the other
> dependencies elevator_lock really needs to be per-set instead of
> per-queue which will allows us to have much saner locking hierarchies.
> 
I believe you meant here q->tag_set->elevator_lock? 
If yes then it means that we should be able to grab ->elevator_lock
before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
order should be in each code path,

__blk_mq_update_nr_hw_queues
    ->elevator_lock 
      ->freeze_lock
or 
blk_register_queue
    ->elevator_lock 
      -> fs_reclaim (GFP_KERNEL)
       -> freeze_lock

Other code paths using ->elevator_lock and ->freeze_lock shall be 
updated accordingly.

Thanks,
--Nilay
Ming Lei April 7, 2025, 3:09 a.m. UTC | #8
On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> > On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> >> Fixes the following lockdep warning:
> > 
> > Please spell the actual dependency out here, links are not permanent
> > and also not readable for any offline reading of the commit logs.
> > 
> >> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >> +				   struct request_queue *q, bool lock)
> >> +{
> >> +	if (lock) {
> > 
> > bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> > screaming at you (in this case even for the right reason :))
> > 
> >> +		/* protect against switching io scheduler  */
> >> +		mutex_lock(&q->elevator_lock);
> >> +		__blk_mq_realloc_hw_ctxs(set, q);
> >> +		mutex_unlock(&q->elevator_lock);
> >> +	} else {
> >> +		__blk_mq_realloc_hw_ctxs(set, q);
> >> +	}
> > 
> > I think the problem here is again that because of all the other
> > dependencies elevator_lock really needs to be per-set instead of
> > per-queue which will allows us to have much saner locking hierarchies.
> > 
> I believe you meant here q->tag_set->elevator_lock? 

I don't know what locks you are planning to invent.

For set->tag_list_lock, it has been very fragile:

blk_mq_update_nr_hw_queues
	set->tag_list_lock
		freeze_queue

If IO failure happens when waiting in above freeze_queue(), the nvme error
handling can't provide forward progress any more, because the error
handling code path requires set->tag_list_lock.

So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
fortunately that is what nvme is doing.


> If yes then it means that we should be able to grab ->elevator_lock
> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> order should be in each code path,
> 
> __blk_mq_update_nr_hw_queues
>     ->elevator_lock 
>       ->freeze_lock

Now tagset->elevator_lock depends on set->tag_list_lock, and this way
just make things worse. Why can't we disable elevator switch during
updating nr_hw_queues?



Thanks,
Ming
Christoph Hellwig April 7, 2025, 6:49 a.m. UTC | #9
On Fri, Apr 04, 2025 at 08:09:31PM +0800, Ming Lei wrote:
> > I think the problem here is again that because of all the other
> > dependencies elevator_lock really needs to be per-set instead of
> > per-queue which will allows us to have much saner locking hierarchies.
> 
> per-set lock is required in error handling code path, anywhere it is used
> with freezing together can be one deadlock risk if hardware error happens.
> 
> Or can you explain the idea in details?

I fail to parse the above sentence.  What about a per-set lock always
taken before freezing queues has an inherentl deadlock risk?
Nilay Shroff April 7, 2025, 8:29 a.m. UTC | #10
On 4/7/25 8:39 AM, Ming Lei wrote:
> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
>>>> Fixes the following lockdep warning:
>>>
>>> Please spell the actual dependency out here, links are not permanent
>>> and also not readable for any offline reading of the commit logs.
>>>
>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>> +				   struct request_queue *q, bool lock)
>>>> +{
>>>> +	if (lock) {
>>>
>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
>>> screaming at you (in this case even for the right reason :))
>>>
>>>> +		/* protect against switching io scheduler  */
>>>> +		mutex_lock(&q->elevator_lock);
>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>> +		mutex_unlock(&q->elevator_lock);
>>>> +	} else {
>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>> +	}
>>>
>>> I think the problem here is again that because of all the other
>>> dependencies elevator_lock really needs to be per-set instead of
>>> per-queue which will allows us to have much saner locking hierarchies.
>>>
>> I believe you meant here q->tag_set->elevator_lock? 
> 
> I don't know what locks you are planning to invent.
> 
> For set->tag_list_lock, it has been very fragile:
> 
> blk_mq_update_nr_hw_queues
> 	set->tag_list_lock
> 		freeze_queue
> 
> If IO failure happens when waiting in above freeze_queue(), the nvme error
> handling can't provide forward progress any more, because the error
> handling code path requires set->tag_list_lock.

I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
which is called in nvme error handling path. If yes then I believe this function 
could be easily modified so that it doesn't require ->tag_list_lock. 

> 
> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> fortunately that is what nvme is doing.
> 
> 
>> If yes then it means that we should be able to grab ->elevator_lock
>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
>> order should be in each code path,
>>
>> __blk_mq_update_nr_hw_queues
>>     ->elevator_lock 
>>       ->freeze_lock
> 
> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> just make things worse. Why can't we disable elevator switch during
> updating nr_hw_queues?
> 
I couldn't quite understand this. As we already first disable the elevator
before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
Once mapping is successful we switch back the elevator.

Thanks,
--Nilay
Ming Lei April 8, 2025, 7:38 a.m. UTC | #11
On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/7/25 8:39 AM, Ming Lei wrote:
> > On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> >>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> >>>> Fixes the following lockdep warning:
> >>>
> >>> Please spell the actual dependency out here, links are not permanent
> >>> and also not readable for any offline reading of the commit logs.
> >>>
> >>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >>>> +				   struct request_queue *q, bool lock)
> >>>> +{
> >>>> +	if (lock) {
> >>>
> >>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> >>> screaming at you (in this case even for the right reason :))
> >>>
> >>>> +		/* protect against switching io scheduler  */
> >>>> +		mutex_lock(&q->elevator_lock);
> >>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>> +		mutex_unlock(&q->elevator_lock);
> >>>> +	} else {
> >>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>> +	}
> >>>
> >>> I think the problem here is again that because of all the other
> >>> dependencies elevator_lock really needs to be per-set instead of
> >>> per-queue which will allows us to have much saner locking hierarchies.
> >>>
> >> I believe you meant here q->tag_set->elevator_lock? 
> > 
> > I don't know what locks you are planning to invent.
> > 
> > For set->tag_list_lock, it has been very fragile:
> > 
> > blk_mq_update_nr_hw_queues
> > 	set->tag_list_lock
> > 		freeze_queue
> > 
> > If IO failure happens when waiting in above freeze_queue(), the nvme error
> > handling can't provide forward progress any more, because the error
> > handling code path requires set->tag_list_lock.
> 
> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues

Yes.

> which is called in nvme error handling path. If yes then I believe this function 
> could be easily modified so that it doesn't require ->tag_list_lock. 

Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".

And the same list is iterated in blk_mq_update_nr_hw_queues() too.

> 
> > 
> > So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> > fortunately that is what nvme is doing.
> > 
> > 
> >> If yes then it means that we should be able to grab ->elevator_lock
> >> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >> order should be in each code path,
> >>
> >> __blk_mq_update_nr_hw_queues
> >>     ->elevator_lock 
> >>       ->freeze_lock
> > 
> > Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> > just make things worse. Why can't we disable elevator switch during
> > updating nr_hw_queues?
> > 
> I couldn't quite understand this. As we already first disable the elevator
> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> Once mapping is successful we switch back the elevator.

Yes, but user still may switch elevator from none to others during the
period, right?


thanks,
Ming
Nilay Shroff April 8, 2025, 1:25 p.m. UTC | #12
On 4/8/25 1:08 PM, Ming Lei wrote:
> On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/7/25 8:39 AM, Ming Lei wrote:
>>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
>>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
>>>>>> Fixes the following lockdep warning:
>>>>>
>>>>> Please spell the actual dependency out here, links are not permanent
>>>>> and also not readable for any offline reading of the commit logs.
>>>>>
>>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>>> +				   struct request_queue *q, bool lock)
>>>>>> +{
>>>>>> +	if (lock) {
>>>>>
>>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
>>>>> screaming at you (in this case even for the right reason :))
>>>>>
>>>>>> +		/* protect against switching io scheduler  */
>>>>>> +		mutex_lock(&q->elevator_lock);
>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>>>> +		mutex_unlock(&q->elevator_lock);
>>>>>> +	} else {
>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>>>> +	}
>>>>>
>>>>> I think the problem here is again that because of all the other
>>>>> dependencies elevator_lock really needs to be per-set instead of
>>>>> per-queue which will allows us to have much saner locking hierarchies.
>>>>>
>>>> I believe you meant here q->tag_set->elevator_lock? 
>>>
>>> I don't know what locks you are planning to invent.
>>>
>>> For set->tag_list_lock, it has been very fragile:
>>>
>>> blk_mq_update_nr_hw_queues
>>> 	set->tag_list_lock
>>> 		freeze_queue
>>>
>>> If IO failure happens when waiting in above freeze_queue(), the nvme error
>>> handling can't provide forward progress any more, because the error
>>> handling code path requires set->tag_list_lock.
>>
>> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> 
> Yes.
> 
>> which is called in nvme error handling path. If yes then I believe this function 
>> could be easily modified so that it doesn't require ->tag_list_lock. 
> 
> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> 
Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 777db89fdaa7..002d2fd20e0c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
 {
        if (!ctrl->tagset)
                return;
-       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
-               blk_mq_quiesce_tagset(ctrl->tagset);
-       else
-               blk_mq_wait_quiesce_done(ctrl->tagset);
+       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
+               struct nvme_ns *ns;
+               int srcu_idx;
+
+               srcu_idx = srcu_read_lock(&ctrl->srcu);
+               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
+                               srcu_read_lock_held(&ctrl->srcu)) {
+                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
+                               blk_mq_quiesce_queue_nowait(ns->queue);
+               }
+               srcu_read_unlock(&ctrl->srcu, srcu_idx);
+       }
+       blk_mq_wait_quiesce_done(ctrl->tagset);
 }
 EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);

Here we iterate through ctrl->namespaces instead of relying on tag_list
and so we don't need to acquire ->tag_list_lock.

> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> 
>>
>>>
>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
>>> fortunately that is what nvme is doing.
>>>
>>>
>>>> If yes then it means that we should be able to grab ->elevator_lock
>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
>>>> order should be in each code path,
>>>>
>>>> __blk_mq_update_nr_hw_queues
>>>>     ->elevator_lock 
>>>>       ->freeze_lock
>>>
>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
>>> just make things worse. Why can't we disable elevator switch during
>>> updating nr_hw_queues?
>>>
>> I couldn't quite understand this. As we already first disable the elevator
>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
>> Once mapping is successful we switch back the elevator.
> 
> Yes, but user still may switch elevator from none to others during the
> period, right?
> 
Yes correct, that's possible. So your suggestion was to disable elevator
update while we're running __blk_mq_update_nr_hw_queues? And that way user
couldn't update elevator through sysfs (elv_iosched_store) while we update
nr_hw_queues? If this is true then still how could it help solve lockdep
splat? 

Thanks,
--Nilay
Ming Lei April 8, 2025, 1:50 p.m. UTC | #13
On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/8/25 1:08 PM, Ming Lei wrote:
> > On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/7/25 8:39 AM, Ming Lei wrote:
> >>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
> >>>>
> >>>>
> >>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> >>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> >>>>>> Fixes the following lockdep warning:
> >>>>>
> >>>>> Please spell the actual dependency out here, links are not permanent
> >>>>> and also not readable for any offline reading of the commit logs.
> >>>>>
> >>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >>>>>> +				   struct request_queue *q, bool lock)
> >>>>>> +{
> >>>>>> +	if (lock) {
> >>>>>
> >>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> >>>>> screaming at you (in this case even for the right reason :))
> >>>>>
> >>>>>> +		/* protect against switching io scheduler  */
> >>>>>> +		mutex_lock(&q->elevator_lock);
> >>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>>>> +		mutex_unlock(&q->elevator_lock);
> >>>>>> +	} else {
> >>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>>>> +	}
> >>>>>
> >>>>> I think the problem here is again that because of all the other
> >>>>> dependencies elevator_lock really needs to be per-set instead of
> >>>>> per-queue which will allows us to have much saner locking hierarchies.
> >>>>>
> >>>> I believe you meant here q->tag_set->elevator_lock? 
> >>>
> >>> I don't know what locks you are planning to invent.
> >>>
> >>> For set->tag_list_lock, it has been very fragile:
> >>>
> >>> blk_mq_update_nr_hw_queues
> >>> 	set->tag_list_lock
> >>> 		freeze_queue
> >>>
> >>> If IO failure happens when waiting in above freeze_queue(), the nvme error
> >>> handling can't provide forward progress any more, because the error
> >>> handling code path requires set->tag_list_lock.
> >>
> >> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> > 
> > Yes.
> > 
> >> which is called in nvme error handling path. If yes then I believe this function 
> >> could be easily modified so that it doesn't require ->tag_list_lock. 
> > 
> > Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> > 
> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 777db89fdaa7..002d2fd20e0c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
>  {
>         if (!ctrl->tagset)
>                 return;
> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> -               blk_mq_quiesce_tagset(ctrl->tagset);
> -       else
> -               blk_mq_wait_quiesce_done(ctrl->tagset);
> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
> +               struct nvme_ns *ns;
> +               int srcu_idx;
> +
> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> +                               srcu_read_lock_held(&ctrl->srcu)) {
> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
> +                               blk_mq_quiesce_queue_nowait(ns->queue);
> +               }
> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
> +       }
> +       blk_mq_wait_quiesce_done(ctrl->tagset);
>  }
>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
> 
> Here we iterate through ctrl->namespaces instead of relying on tag_list
> and so we don't need to acquire ->tag_list_lock.

How can you make sure all NSs are covered in this way? RCU/SRCU can't
provide such kind of guarantee.

> 
> > And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> > 
> >>
> >>>
> >>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> >>> fortunately that is what nvme is doing.
> >>>
> >>>
> >>>> If yes then it means that we should be able to grab ->elevator_lock
> >>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >>>> order should be in each code path,
> >>>>
> >>>> __blk_mq_update_nr_hw_queues
> >>>>     ->elevator_lock 
> >>>>       ->freeze_lock
> >>>
> >>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> >>> just make things worse. Why can't we disable elevator switch during
> >>> updating nr_hw_queues?
> >>>
> >> I couldn't quite understand this. As we already first disable the elevator
> >> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> >> Once mapping is successful we switch back the elevator.
> > 
> > Yes, but user still may switch elevator from none to others during the
> > period, right?
> > 
> Yes correct, that's possible. So your suggestion was to disable elevator
> update while we're running __blk_mq_update_nr_hw_queues? And that way user
> couldn't update elevator through sysfs (elv_iosched_store) while we update
> nr_hw_queues? If this is true then still how could it help solve lockdep
> splat? 

Then why do you think per-set lock can solve the lockdep splat?

__blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
involved wrt. switching elevator. If elevator switching is not allowed
when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
lock?


Thanks,
Ming
Nilay Shroff April 9, 2025, 9:12 a.m. UTC | #14
On 4/8/25 7:20 PM, Ming Lei wrote:
> On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/8/25 1:08 PM, Ming Lei wrote:
>>> On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/7/25 8:39 AM, Ming Lei wrote:
>>>>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
>>>>>>
>>>>>>
>>>>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
>>>>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
>>>>>>>> Fixes the following lockdep warning:
>>>>>>>
>>>>>>> Please spell the actual dependency out here, links are not permanent
>>>>>>> and also not readable for any offline reading of the commit logs.
>>>>>>>
>>>>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>>>>> +				   struct request_queue *q, bool lock)
>>>>>>>> +{
>>>>>>>> +	if (lock) {
>>>>>>>
>>>>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
>>>>>>> screaming at you (in this case even for the right reason :))
>>>>>>>
>>>>>>>> +		/* protect against switching io scheduler  */
>>>>>>>> +		mutex_lock(&q->elevator_lock);
>>>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>>>>>> +		mutex_unlock(&q->elevator_lock);
>>>>>>>> +	} else {
>>>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
>>>>>>>> +	}
>>>>>>>
>>>>>>> I think the problem here is again that because of all the other
>>>>>>> dependencies elevator_lock really needs to be per-set instead of
>>>>>>> per-queue which will allows us to have much saner locking hierarchies.
>>>>>>>
>>>>>> I believe you meant here q->tag_set->elevator_lock? 
>>>>>
>>>>> I don't know what locks you are planning to invent.
>>>>>
>>>>> For set->tag_list_lock, it has been very fragile:
>>>>>
>>>>> blk_mq_update_nr_hw_queues
>>>>> 	set->tag_list_lock
>>>>> 		freeze_queue
>>>>>
>>>>> If IO failure happens when waiting in above freeze_queue(), the nvme error
>>>>> handling can't provide forward progress any more, because the error
>>>>> handling code path requires set->tag_list_lock.
>>>>
>>>> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
>>>
>>> Yes.
>>>
>>>> which is called in nvme error handling path. If yes then I believe this function 
>>>> could be easily modified so that it doesn't require ->tag_list_lock. 
>>>
>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
>>>
>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 777db89fdaa7..002d2fd20e0c 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
>>  {
>>         if (!ctrl->tagset)
>>                 return;
>> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>> -               blk_mq_quiesce_tagset(ctrl->tagset);
>> -       else
>> -               blk_mq_wait_quiesce_done(ctrl->tagset);
>> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
>> +               struct nvme_ns *ns;
>> +               int srcu_idx;
>> +
>> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
>> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>> +                               srcu_read_lock_held(&ctrl->srcu)) {
>> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
>> +                               blk_mq_quiesce_queue_nowait(ns->queue);
>> +               }
>> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
>> +       }
>> +       blk_mq_wait_quiesce_done(ctrl->tagset);
>>  }
>>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
>>
>> Here we iterate through ctrl->namespaces instead of relying on tag_list
>> and so we don't need to acquire ->tag_list_lock.
> 
> How can you make sure all NSs are covered in this way? RCU/SRCU can't
> provide such kind of guarantee.
> 
Why is that so? In fact, nvme_wait_freeze also iterates through 
the same ctrl->namespaces to freeze the queue.

>>
>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
>>>
>>>>
>>>>>
>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
>>>>> fortunately that is what nvme is doing.
>>>>>
>>>>>
>>>>>> If yes then it means that we should be able to grab ->elevator_lock
>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
>>>>>> order should be in each code path,
>>>>>>
>>>>>> __blk_mq_update_nr_hw_queues
>>>>>>     ->elevator_lock 
>>>>>>       ->freeze_lock
>>>>>
>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
>>>>> just make things worse. Why can't we disable elevator switch during
>>>>> updating nr_hw_queues?
>>>>>
>>>> I couldn't quite understand this. As we already first disable the elevator
>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
>>>> Once mapping is successful we switch back the elevator.
>>>
>>> Yes, but user still may switch elevator from none to others during the
>>> period, right?
>>>
>> Yes correct, that's possible. So your suggestion was to disable elevator
>> update while we're running __blk_mq_update_nr_hw_queues? And that way user
>> couldn't update elevator through sysfs (elv_iosched_store) while we update
>> nr_hw_queues? If this is true then still how could it help solve lockdep
>> splat? 
> 
> Then why do you think per-set lock can solve the lockdep splat?
> 
> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> involved wrt. switching elevator. If elevator switching is not allowed
> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> lock?
> 
Yes if elevator switch is not allowed then we probably don't need per-set lock. 
However my question was if we were to not allow elevator switch while 
__blk_mq_update_nr_hw_queues is running then how would we implement it?
Do we need to synchronize with ->tag_list_lock? Or in another words,
elv_iosched_store would now depends on ->tag_list_lock ? 

On another note, if we choose to make ->elevator_lock per-set then 
our locking sequence in blk_mq_update_nr_hw_queues() would be,

blk_mq_update_nr_hw_queues
  -> tag_list_lock
    -> elevator_lock
     -> freeze_lock 

elv_iosched_store
  -> elevator_lock
    -> freeze_lock

So now ->freeze_lock should not depend on ->elevator_lock and that shall
help avoid few of the recent lockdep splats reported with fs_reclaim.
What do you think?

Thanks,
--Nilay
Ming Lei April 9, 2025, 11:46 a.m. UTC | #15
On Wed, Apr 09, 2025 at 02:42:06PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/8/25 7:20 PM, Ming Lei wrote:
> > On Tue, Apr 08, 2025 at 06:55:26PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/8/25 1:08 PM, Ming Lei wrote:
> >>> On Mon, Apr 07, 2025 at 01:59:48PM +0530, Nilay Shroff wrote:
> >>>>
> >>>>
> >>>> On 4/7/25 8:39 AM, Ming Lei wrote:
> >>>>> On Sat, Apr 05, 2025 at 07:44:19PM +0530, Nilay Shroff wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 4/4/25 2:40 PM, Christoph Hellwig wrote:
> >>>>>>> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> >>>>>>>> Fixes the following lockdep warning:
> >>>>>>>
> >>>>>>> Please spell the actual dependency out here, links are not permanent
> >>>>>>> and also not readable for any offline reading of the commit logs.
> >>>>>>>
> >>>>>>>> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >>>>>>>> +				   struct request_queue *q, bool lock)
> >>>>>>>> +{
> >>>>>>>> +	if (lock) {
> >>>>>>>
> >>>>>>> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> >>>>>>> screaming at you (in this case even for the right reason :))
> >>>>>>>
> >>>>>>>> +		/* protect against switching io scheduler  */
> >>>>>>>> +		mutex_lock(&q->elevator_lock);
> >>>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>>>>>> +		mutex_unlock(&q->elevator_lock);
> >>>>>>>> +	} else {
> >>>>>>>> +		__blk_mq_realloc_hw_ctxs(set, q);
> >>>>>>>> +	}
> >>>>>>>
> >>>>>>> I think the problem here is again that because of all the other
> >>>>>>> dependencies elevator_lock really needs to be per-set instead of
> >>>>>>> per-queue which will allows us to have much saner locking hierarchies.
> >>>>>>>
> >>>>>> I believe you meant here q->tag_set->elevator_lock? 
> >>>>>
> >>>>> I don't know what locks you are planning to invent.
> >>>>>
> >>>>> For set->tag_list_lock, it has been very fragile:
> >>>>>
> >>>>> blk_mq_update_nr_hw_queues
> >>>>> 	set->tag_list_lock
> >>>>> 		freeze_queue
> >>>>>
> >>>>> If IO failure happens when waiting in above freeze_queue(), the nvme error
> >>>>> handling can't provide forward progress any more, because the error
> >>>>> handling code path requires set->tag_list_lock.
> >>>>
> >>>> I think you're referring here nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> >>>
> >>> Yes.
> >>>
> >>>> which is called in nvme error handling path. If yes then I believe this function 
> >>>> could be easily modified so that it doesn't require ->tag_list_lock. 
> >>>
> >>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> >>>
> >> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
> >>
> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> index 777db89fdaa7..002d2fd20e0c 100644
> >> --- a/drivers/nvme/host/core.c
> >> +++ b/drivers/nvme/host/core.c
> >> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
> >>  {
> >>         if (!ctrl->tagset)
> >>                 return;
> >> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> >> -               blk_mq_quiesce_tagset(ctrl->tagset);
> >> -       else
> >> -               blk_mq_wait_quiesce_done(ctrl->tagset);
> >> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
> >> +               struct nvme_ns *ns;
> >> +               int srcu_idx;
> >> +
> >> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
> >> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> >> +                               srcu_read_lock_held(&ctrl->srcu)) {
> >> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
> >> +                               blk_mq_quiesce_queue_nowait(ns->queue);
> >> +               }
> >> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
> >> +       }
> >> +       blk_mq_wait_quiesce_done(ctrl->tagset);
> >>  }
> >>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
> >>
> >> Here we iterate through ctrl->namespaces instead of relying on tag_list
> >> and so we don't need to acquire ->tag_list_lock.
> > 
> > How can you make sure all NSs are covered in this way? RCU/SRCU can't
> > provide such kind of guarantee.
> > 
> Why is that so? In fact, nvme_wait_freeze also iterates through 
> the same ctrl->namespaces to freeze the queue.

It depends if nvme error handling needs to cover new coming NS,
suppose it doesn't care, and you can change to srcu and bypass
->tag_list_lock.

> 
> >>
> >>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> >>>
> >>>>
> >>>>>
> >>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> >>>>> fortunately that is what nvme is doing.
> >>>>>
> >>>>>
> >>>>>> If yes then it means that we should be able to grab ->elevator_lock
> >>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >>>>>> order should be in each code path,
> >>>>>>
> >>>>>> __blk_mq_update_nr_hw_queues
> >>>>>>     ->elevator_lock 
> >>>>>>       ->freeze_lock
> >>>>>
> >>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> >>>>> just make things worse. Why can't we disable elevator switch during
> >>>>> updating nr_hw_queues?
> >>>>>
> >>>> I couldn't quite understand this. As we already first disable the elevator
> >>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> >>>> Once mapping is successful we switch back the elevator.
> >>>
> >>> Yes, but user still may switch elevator from none to others during the
> >>> period, right?
> >>>
> >> Yes correct, that's possible. So your suggestion was to disable elevator
> >> update while we're running __blk_mq_update_nr_hw_queues? And that way user
> >> couldn't update elevator through sysfs (elv_iosched_store) while we update
> >> nr_hw_queues? If this is true then still how could it help solve lockdep
> >> splat? 
> > 
> > Then why do you think per-set lock can solve the lockdep splat?
> > 
> > __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> > involved wrt. switching elevator. If elevator switching is not allowed
> > when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> > lock?
> > 
> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
> However my question was if we were to not allow elevator switch while 
> __blk_mq_update_nr_hw_queues is running then how would we implement it?

It can be done easily by tag_set->srcu.

> Do we need to synchronize with ->tag_list_lock? Or in another words,
> elv_iosched_store would now depends on ->tag_list_lock ? 

->tag_list_lock isn't involved.

> 
> On another note, if we choose to make ->elevator_lock per-set then 
> our locking sequence in blk_mq_update_nr_hw_queues() would be,

There is also add/del disk vs. updating nr_hw_queues, do you want to
add the per-set lock in add/del disk path too?

> 
> blk_mq_update_nr_hw_queues
>   -> tag_list_lock
>     -> elevator_lock
>      -> freeze_lock 

Actually freeze lock is already held for nvme before calling
blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
frozen for updating nr_hw_queues, so the above order may not match
with the existed code.

Do we need to consider nvme or blk_mq_update_nr_hw_queues now?

> 
> elv_iosched_store
>   -> elevator_lock
>     -> freeze_lock

I understand that the per-set elevator_lock is just for avoiding the
nested elvevator lock class acquire? If we needn't to consider nvme
or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed.

It is actually easy to sync elevator store vs. update nr_hw_queues.

> 
> So now ->freeze_lock should not depend on ->elevator_lock and that shall
> help avoid few of the recent lockdep splats reported with fs_reclaim.
> What do you think?

Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
related splat.

However, in del_gendisk(), freeze_lock is still held before calling
elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.

Thanks,
Ming
Nilay Shroff April 9, 2025, 1:46 p.m. UTC | #16
On 4/9/25 5:16 PM, Ming Lei wrote:
>>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
>>>>>
>>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 777db89fdaa7..002d2fd20e0c 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
>>>>  {
>>>>         if (!ctrl->tagset)
>>>>                 return;
>>>> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>>> -               blk_mq_quiesce_tagset(ctrl->tagset);
>>>> -       else
>>>> -               blk_mq_wait_quiesce_done(ctrl->tagset);
>>>> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
>>>> +               struct nvme_ns *ns;
>>>> +               int srcu_idx;
>>>> +
>>>> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
>>>> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>>> +                               srcu_read_lock_held(&ctrl->srcu)) {
>>>> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
>>>> +                               blk_mq_quiesce_queue_nowait(ns->queue);
>>>> +               }
>>>> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>>> +       }
>>>> +       blk_mq_wait_quiesce_done(ctrl->tagset);
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
>>>>
>>>> Here we iterate through ctrl->namespaces instead of relying on tag_list
>>>> and so we don't need to acquire ->tag_list_lock.
>>>
>>> How can you make sure all NSs are covered in this way? RCU/SRCU can't
>>> provide such kind of guarantee.
>>>
>> Why is that so? In fact, nvme_wait_freeze also iterates through 
>> the same ctrl->namespaces to freeze the queue.
> 
> It depends if nvme error handling needs to cover new coming NS,
> suppose it doesn't care, and you can change to srcu and bypass
> ->tag_list_lock.
> 
Yes new incoming NS may not be live yet when we iterate through 
ctrl->namespaces. So we don't need bother about it yet.
>>
>>>>
>>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
>>>>>>> fortunately that is what nvme is doing.
>>>>>>>
>>>>>>>
>>>>>>>> If yes then it means that we should be able to grab ->elevator_lock
>>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
>>>>>>>> order should be in each code path,
>>>>>>>>
>>>>>>>> __blk_mq_update_nr_hw_queues
>>>>>>>>     ->elevator_lock 
>>>>>>>>       ->freeze_lock
>>>>>>>
>>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
>>>>>>> just make things worse. Why can't we disable elevator switch during
>>>>>>> updating nr_hw_queues?
>>>>>>>
>>>>>> I couldn't quite understand this. As we already first disable the elevator
>>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
>>>>>> Once mapping is successful we switch back the elevator.
>>>>>
>>>>> Yes, but user still may switch elevator from none to others during the
>>>>> period, right?
>>>>>
>>>> Yes correct, that's possible. So your suggestion was to disable elevator
>>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user
>>>> couldn't update elevator through sysfs (elv_iosched_store) while we update
>>>> nr_hw_queues? If this is true then still how could it help solve lockdep
>>>> splat? 
>>>
>>> Then why do you think per-set lock can solve the lockdep splat?
>>>
>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
>>> involved wrt. switching elevator. If elevator switching is not allowed
>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
>>> lock?
>>>
>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
>> However my question was if we were to not allow elevator switch while 
>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> 
> It can be done easily by tag_set->srcu.
Ok great if that's possible! But I'm not sure how it could be done in this
case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
run in the writer/updater context. So you may still need lock? Can you
please send across a (informal) patch with your idea ?

> 
>> Do we need to synchronize with ->tag_list_lock? Or in another words,
>> elv_iosched_store would now depends on ->tag_list_lock ? 
> 
> ->tag_list_lock isn't involved.
> 
>>
>> On another note, if we choose to make ->elevator_lock per-set then 
>> our locking sequence in blk_mq_update_nr_hw_queues() would be,
> 
> There is also add/del disk vs. updating nr_hw_queues, do you want to
> add the per-set lock in add/del disk path too?

Ideally no we don't need to acquire ->elevator_lock in this path.
Please see below.

>>
>> blk_mq_update_nr_hw_queues
>>   -> tag_list_lock
>>     -> elevator_lock
>>      -> freeze_lock 
> 
> Actually freeze lock is already held for nvme before calling
> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> frozen for updating nr_hw_queues, so the above order may not match
> with the existed code.
> 
> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> 
I think we should consider (may be in different patch) updating
nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
its dependency on ->tag_list_lock.

>>
>> elv_iosched_store
>>   -> elevator_lock
>>     -> freeze_lock
> 
> I understand that the per-set elevator_lock is just for avoiding the
> nested elvevator lock class acquire? If we needn't to consider nvme
> or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed.
> 
> It is actually easy to sync elevator store vs. update nr_hw_queues.
> 
>>
>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
>> help avoid few of the recent lockdep splats reported with fs_reclaim.
>> What do you think?
> 
> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> related splat.
> 
> However, in del_gendisk(), freeze_lock is still held before calling
> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.

Yes agreed, however elevator_exit() called from del_gendisk() or 
elv_unregister_queue() called from blk_unregister_queue() are called 
after we unregister the queue. And if queue has been already unregistered
while invoking elevator_exit or del_gensidk then ideally we don't need to
acquire ->elevator_lock. The same is true for elevator_exit() called 
from add_disk_fwnode(). So IMO, we should update these paths to avoid 
acquiring ->elevator_lock.

Thanks,
--Nilay
Ming Lei April 9, 2025, 2:08 p.m. UTC | #17
On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/9/25 5:16 PM, Ming Lei wrote:
> >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> >>>>>
> >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
> >>>>
> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>>> index 777db89fdaa7..002d2fd20e0c 100644
> >>>> --- a/drivers/nvme/host/core.c
> >>>> +++ b/drivers/nvme/host/core.c
> >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
> >>>>  {
> >>>>         if (!ctrl->tagset)
> >>>>                 return;
> >>>> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> >>>> -               blk_mq_quiesce_tagset(ctrl->tagset);
> >>>> -       else
> >>>> -               blk_mq_wait_quiesce_done(ctrl->tagset);
> >>>> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
> >>>> +               struct nvme_ns *ns;
> >>>> +               int srcu_idx;
> >>>> +
> >>>> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
> >>>> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> >>>> +                               srcu_read_lock_held(&ctrl->srcu)) {
> >>>> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
> >>>> +                               blk_mq_quiesce_queue_nowait(ns->queue);
> >>>> +               }
> >>>> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
> >>>> +       }
> >>>> +       blk_mq_wait_quiesce_done(ctrl->tagset);
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
> >>>>
> >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list
> >>>> and so we don't need to acquire ->tag_list_lock.
> >>>
> >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't
> >>> provide such kind of guarantee.
> >>>
> >> Why is that so? In fact, nvme_wait_freeze also iterates through 
> >> the same ctrl->namespaces to freeze the queue.
> > 
> > It depends if nvme error handling needs to cover new coming NS,
> > suppose it doesn't care, and you can change to srcu and bypass
> > ->tag_list_lock.
> > 
> Yes new incoming NS may not be live yet when we iterate through 
> ctrl->namespaces. So we don't need bother about it yet.
> >>
> >>>>
> >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> >>>>>>> fortunately that is what nvme is doing.
> >>>>>>>
> >>>>>>>
> >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock
> >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >>>>>>>> order should be in each code path,
> >>>>>>>>
> >>>>>>>> __blk_mq_update_nr_hw_queues
> >>>>>>>>     ->elevator_lock 
> >>>>>>>>       ->freeze_lock
> >>>>>>>
> >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> >>>>>>> just make things worse. Why can't we disable elevator switch during
> >>>>>>> updating nr_hw_queues?
> >>>>>>>
> >>>>>> I couldn't quite understand this. As we already first disable the elevator
> >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> >>>>>> Once mapping is successful we switch back the elevator.
> >>>>>
> >>>>> Yes, but user still may switch elevator from none to others during the
> >>>>> period, right?
> >>>>>
> >>>> Yes correct, that's possible. So your suggestion was to disable elevator
> >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user
> >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update
> >>>> nr_hw_queues? If this is true then still how could it help solve lockdep
> >>>> splat? 
> >>>
> >>> Then why do you think per-set lock can solve the lockdep splat?
> >>>
> >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> >>> involved wrt. switching elevator. If elevator switching is not allowed
> >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> >>> lock?
> >>>
> >> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
> >> However my question was if we were to not allow elevator switch while 
> >> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> > 
> > It can be done easily by tag_set->srcu.
> Ok great if that's possible! But I'm not sure how it could be done in this
> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
> run in the writer/updater context. So you may still need lock? Can you
> please send across a (informal) patch with your idea ?

Please see the attached patch which treats elv_iosched_store() as reader.

> 
> > 
> >> Do we need to synchronize with ->tag_list_lock? Or in another words,
> >> elv_iosched_store would now depends on ->tag_list_lock ? 
> > 
> > ->tag_list_lock isn't involved.
> > 
> >>
> >> On another note, if we choose to make ->elevator_lock per-set then 
> >> our locking sequence in blk_mq_update_nr_hw_queues() would be,
> > 
> > There is also add/del disk vs. updating nr_hw_queues, do you want to
> > add the per-set lock in add/del disk path too?
> 
> Ideally no we don't need to acquire ->elevator_lock in this path.
> Please see below.

blk_mq_update_nr_hw_queues() can come anytime, and there is still
race between blk_mq_update_nr_hw_queues and add/del disk.

> 
> >>
> >> blk_mq_update_nr_hw_queues
> >>   -> tag_list_lock
> >>     -> elevator_lock
> >>      -> freeze_lock 
> > 
> > Actually freeze lock is already held for nvme before calling
> > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> > frozen for updating nr_hw_queues, so the above order may not match
> > with the existed code.
> > 
> > Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> > 
> I think we should consider (may be in different patch) updating
> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> its dependency on ->tag_list_lock.

If we need to take nvme into account, the above lock order doesn't work,
because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
and elevator lock still depends on freeze lock.

If it needn't to be considered, per-set lock becomes necessary.

> 
> >>
> >> elv_iosched_store
> >>   -> elevator_lock
> >>     -> freeze_lock
> > 
> > I understand that the per-set elevator_lock is just for avoiding the
> > nested elvevator lock class acquire? If we needn't to consider nvme
> > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed.
> > 
> > It is actually easy to sync elevator store vs. update nr_hw_queues.
> > 
> >>
> >> So now ->freeze_lock should not depend on ->elevator_lock and that shall
> >> help avoid few of the recent lockdep splats reported with fs_reclaim.
> >> What do you think?
> > 
> > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> > related splat.
> > 
> > However, in del_gendisk(), freeze_lock is still held before calling
> > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
> 
> Yes agreed, however elevator_exit() called from del_gendisk() or 
> elv_unregister_queue() called from blk_unregister_queue() are called 
> after we unregister the queue. And if queue has been already unregistered
> while invoking elevator_exit or del_gensidk then ideally we don't need to
> acquire ->elevator_lock. The same is true for elevator_exit() called 
> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
> acquiring ->elevator_lock.

As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
host wide event, so either lock or srcu sync is needed.


Thanks,
Ming
From a475139e47e745f32a68725f4abc59f9a0083d57 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 1 Apr 2025 12:42:29 +0000
Subject: [PATCH] block: prevent elevator switch during updating nr_hw_queues

updating nr_hw_queues is usually used for error handling code, when it
doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
may change, and elevator tags depends on nr_hw_queues.

Prevent elevator switch during updating nr_hw_queues by setting flag of
BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
the period.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq.c         | 32 ++++++++++++++++++++------------
 block/elevator.c       | 12 +++++++++++-
 include/linux/blk-mq.h |  9 ++++++++-
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c308699ded58..27f984311bb7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -180,6 +180,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(TAG_RR),
 	HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT),
+	HCTX_FLAG_NAME(UPDATE_HW_QUEUES),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d7a103dc258b..c1e7e1823369 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4776,14 +4776,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
 		set->nr_hw_queues = nr_cpu_ids;
 
-	if (set->flags & BLK_MQ_F_BLOCKING) {
-		set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
-		if (!set->srcu)
-			return -ENOMEM;
-		ret = init_srcu_struct(set->srcu);
-		if (ret)
-			goto out_free_srcu;
-	}
+	set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
+	if (!set->srcu)
+		return -ENOMEM;
+	ret = init_srcu_struct(set->srcu);
+	if (ret)
+		goto out_free_srcu;
 
 	ret = -ENOMEM;
 	set->tags = kcalloc_node(set->nr_hw_queues,
@@ -4864,10 +4862,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 
 	kfree(set->tags);
 	set->tags = NULL;
-	if (set->flags & BLK_MQ_F_BLOCKING) {
-		cleanup_srcu_struct(set->srcu);
-		kfree(set->srcu);
-	}
+
+	cleanup_srcu_struct(set->srcu);
+	kfree(set->srcu);
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
@@ -5081,7 +5078,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	mutex_lock(&set->tag_list_lock);
+	/*
+	 * Mark us in updating nr_hw_queues for preventing switching
+	 * elevator
+	 *
+	 * Elevator switch code can _not_ acquire ->tag_list_lock
+	 */
+	set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
+	synchronize_srcu(set->srcu);
+
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
+
+	set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
 	mutex_unlock(&set->tag_list_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/block/elevator.c b/block/elevator.c
index cf48613c6e62..e50e04ed15a0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -718,9 +718,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 {
 	char elevator_name[ELV_NAME_MAX];
 	char *name;
-	int ret;
+	int ret, idx;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	/*
 	 * If the attribute needs to load a module, do it before freezing the
@@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 
 	elv_iosched_load_module(name);
 
+	idx = srcu_read_lock(set->srcu);
+
+	if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
+		ret = -EBUSY;
+		goto exit;
+	}
+
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
@@ -739,6 +747,8 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = count;
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+exit:
+	srcu_read_unlock(set->srcu, idx);
 	return ret;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..71e05245af9d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -681,7 +681,14 @@ enum {
 	 */
 	BLK_MQ_F_NO_SCHED_BY_DEFAULT	= 1 << 6,
 
-	BLK_MQ_F_MAX = 1 << 7,
+	/*
+	 * True when updating nr_hw_queues is in-progress
+	 *
+	 * tag_set only flag, not usable for hctx
+	 */
+	BLK_MQ_F_UPDATE_HW_QUEUES	= 1 << 7,
+
+	BLK_MQ_F_MAX = 1 << 8,
 };
 
 #define BLK_MQ_MAX_DEPTH	(10240)
Nilay Shroff April 9, 2025, 7:45 p.m. UTC | #18
On 4/9/25 7:38 PM, Ming Lei wrote:
>>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
>>>>> involved wrt. switching elevator. If elevator switching is not allowed
>>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
>>>>> lock?
>>>>>
>>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
>>>> However my question was if we were to not allow elevator switch while 
>>>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
>>>
>>> It can be done easily by tag_set->srcu.
>> Ok great if that's possible! But I'm not sure how it could be done in this
>> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
>> run in the writer/updater context. So you may still need lock? Can you
>> please send across a (informal) patch with your idea ?
> 
> Please see the attached patch which treats elv_iosched_store() as reader.
> 
I looked trough patch and looks good. However we still have an issue 
in code paths where we acquire ->elevator_lock without first freezing 
the queue.In this case if we try allocate memory using GFP_KERNEL 
then fs_reclaim would still trigger lockdep splat. As for this case
the lock order would be,

elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)

In fact, I tested your patch and got into the above splat. I've 
attached lockdep splat for your reference. So IMO, we should instead
try make locking order such that ->freeze_lock shall never depend on
->elevator_lock. It seems one way to make that possible if we make
->elevator_lock per-set. 

>>>
>>> Actually freeze lock is already held for nvme before calling
>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
>>> frozen for updating nr_hw_queues, so the above order may not match
>>> with the existed code.
>>>
>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
>>>
>> I think we should consider (may be in different patch) updating
>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
>> its dependency on ->tag_list_lock.
> 
> If we need to take nvme into account, the above lock order doesn't work,
> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> and elevator lock still depends on freeze lock.
> 
> If it needn't to be considered, per-set lock becomes necessary.
IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
we shall also update nvme pci, rdma and tcp drivers so that those 
drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.

>>>>
>>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
>>>> help avoid few of the recent lockdep splats reported with fs_reclaim.
>>>> What do you think?
>>>
>>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
>>> related splat.
>>>
>>> However, in del_gendisk(), freeze_lock is still held before calling
>>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
>>
>> Yes agreed, however elevator_exit() called from del_gendisk() or 
>> elv_unregister_queue() called from blk_unregister_queue() are called 
>> after we unregister the queue. And if queue has been already unregistered
>> while invoking elevator_exit or del_gensidk then ideally we don't need to
>> acquire ->elevator_lock. The same is true for elevator_exit() called 
>> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
>> acquiring ->elevator_lock.
> 
> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> host wide event, so either lock or srcu sync is needed.
Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
with del_gendisk or blk_unregister_queue.

Thanks,
--Nilay
[  399.112145] ======================================================
[  399.112153] WARNING: possible circular locking dependency detected
[  399.112160] 6.15.0-rc1+ #159 Not tainted
[  399.112167] ------------------------------------------------------
[  399.112174] bash/5891 is trying to acquire lock:
[  399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4
[  399.112205]
[  399.112205] but task is already holding lock:
[  399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.112231]
[  399.112231] which lock already depends on the new lock.
[  399.112231]
[  399.112239]
[  399.112239] the existing dependency chain (in reverse order) is:
[  399.112246]
[  399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}:
[  399.112260]        blk_alloc_queue+0x3a8/0x3e4
[  399.112268]        blk_mq_alloc_queue+0x88/0x11c
[  399.112278]        __blk_mq_alloc_disk+0x34/0xd8
[  399.112290]        null_add_dev+0x3c8/0x914 [null_blk]
[  399.112306]        null_init+0x1e0/0x4bc [null_blk]
[  399.112319]        do_one_initcall+0x8c/0x4b8
[  399.112340]        do_init_module+0x7c/0x2c4
[  399.112396]        init_module_from_file+0xb4/0x108
[  399.112405]        idempotent_init_module+0x26c/0x368
[  399.112414]        sys_finit_module+0x98/0x150
[  399.112422]        system_call_exception+0x134/0x360
[  399.112430]        system_call_vectored_common+0x15c/0x2ec
[  399.112441]
[  399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}:
[  399.112453]        fs_reclaim_acquire+0xe4/0x120
[  399.112461]        kmem_cache_alloc_noprof+0x74/0x570
[  399.112469]        __kernfs_new_node+0x98/0x37c
[  399.112479]        kernfs_new_node+0x94/0xe4
[  399.112490]        kernfs_create_dir_ns+0x44/0x120
[  399.112501]        sysfs_create_dir_ns+0x94/0x160
[  399.112512]        kobject_add_internal+0xf4/0x3c8
[  399.112524]        kobject_add+0x70/0x10c
[  399.112533]        elv_register_queue+0x70/0x14c
[  399.112562]        blk_register_queue+0x1d8/0x2bc
[  399.112575]        add_disk_fwnode+0x3b4/0x5d0
[  399.112588]        sd_probe+0x3bc/0x5b4 [sd_mod]
[  399.112601]        really_probe+0x104/0x4c4
[  399.112613]        __driver_probe_device+0xb8/0x200
[  399.112623]        driver_probe_device+0x54/0x128
[  399.112632]        __driver_attach_async_helper+0x7c/0x150
[  399.112641]        async_run_entry_fn+0x60/0x1bc
[  399.112665]        process_one_work+0x2ac/0x7e4
[  399.112678]        worker_thread+0x238/0x460
[  399.112691]        kthread+0x158/0x188
[  399.112700]        start_kernel_thread+0x14/0x18
[  399.112722]
[  399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}:
[  399.112737]        __lock_acquire+0x1bec/0x2bc8
[  399.112747]        lock_acquire+0x140/0x430
[  399.112760]        __mutex_lock+0xf0/0xb00
[  399.112769]        elv_iosched_store+0x11c/0x5d4
[  399.112781]        queue_attr_store+0x12c/0x164
[  399.112792]        sysfs_kf_write+0x74/0xc4
[  399.112804]        kernfs_fop_write_iter+0x1a8/0x2a4
[  399.112816]        vfs_write+0x410/0x584
[  399.112829]        ksys_write+0x84/0x140
[  399.112841]        system_call_exception+0x134/0x360
[  399.112851]        system_call_vectored_common+0x15c/0x2ec
[  399.112863]
[  399.112863] other info that might help us debug this:
[  399.112863]
[  399.112874] Chain exists of:
[  399.112874]   &q->elevator_lock --> fs_reclaim --> &q->q_usage_counter(io)#20
[  399.112874]
[  399.112896]  Possible unsafe locking scenario:
[  399.112896]
[  399.112905]        CPU0                    CPU1
[  399.112928]        ----                    ----
[  399.112936]   lock(&q->q_usage_counter(io)#20);
[  399.112954]                                lock(fs_reclaim);
[  399.112967]                                lock(&q->q_usage_counter(io)#20);
[  399.112987]   lock(&q->elevator_lock);
[  399.112998]
[  399.112998]  *** DEADLOCK ***
[  399.112998]
[  399.113010] 5 locks held by bash/5891:
[  399.113017]  #0: c000000009f87400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
[  399.113042]  #1: c0000000097cfa88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x164/0x2a4
[  399.113064]  #2: c0000000ba4634f8 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x170/0x2a4
[  399.113083]  #3: c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.113102]  #4: c0000000bc233010 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.113120]
[  399.113120] stack backtrace:
[  399.113126] CPU: 24 UID: 0 PID: 5891 Comm: bash Kdump: loaded Not tainted 6.15.0-rc1+ #159 VOLUNTARY
[  399.113130] Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
[  399.113132] Call Trace:
[  399.113133] [c000000082bd7580] [c0000000011b69f8] dump_stack_lvl+0x108/0x18c (unreliable)
[  399.113141] [c000000082bd75b0] [c00000000022512c] print_circular_bug+0x448/0x604
[  399.113146] [c000000082bd7660] [c000000000225534] check_noncircular+0x24c/0x26c
[  399.113150] [c000000082bd7730] [c00000000022b998] __lock_acquire+0x1bec/0x2bc8
[  399.113155] [c000000082bd7860] [c000000000228d20] lock_acquire+0x140/0x430
[  399.113159] [c000000082bd7960] [c0000000011f76e8] __mutex_lock+0xf0/0xb00
[  399.113163] [c000000082bd7a90] [c00000000090a558] elv_iosched_store+0x11c/0x5d4
[  399.113169] [c000000082bd7b50] [c000000000912d20] queue_attr_store+0x12c/0x164
[  399.113174] [c000000082bd7c60] [c0000000007d9114] sysfs_kf_write+0x74/0xc4
[  399.113179] [c000000082bd7ca0] [c0000000007d5b60] kernfs_fop_write_iter+0x1a8/0x2a4
[  399.113185] [c000000082bd7cf0] [c0000000006b386c] vfs_write+0x410/0x584
[  399.113190] [c000000082bd7dc0] [c0000000006b3d18] ksys_write+0x84/0x140
[  399.113196] [c000000082bd7e10] [c000000000031814] system_call_exception+0x134/0x360
[  399.113199] [c000000082bd7e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
[  399.113205] --- interrupt: 3000 at 0x7fffb113b034
[  399.113208] NIP:  00007fffb113b034 LR: 00007fffb113b034 CTR: 0000000000000000
[  399.113211] REGS: c000000082bd7e80 TRAP: 3000   Not tainted  (6.15.0-rc1+)
[  399.113213] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44422408  XER: 00000000
[  399.113223] IRQMASK: 0
[  399.113223] GPR00: 0000000000000004 00007fffe78cf020 00000001114d7e00 0000000000000001
[  399.113223] GPR04: 000000014ce03ae0 000000000000000c 0000000000000010 0000000000000001
[  399.113223] GPR08: 000000000000000b 0000000000000000 0000000000000000 0000000000000000
[  399.113223] GPR12: 0000000000000000 00007fffb139ab60 000000014cef4ed0 00000001114d87b8
[  399.113223] GPR16: 00000001114d94d8 0000000020000000 0000000000000000 00000001113e9070
[  399.113223] GPR20: 000000011147beb8 00007fffe78cf1c4 000000011147f8a0 00000001114d89bc
[  399.113223] GPR24: 00000001114d8a50 0000000000000000 000000014ce03ae0 000000000000000c
[  399.113223] GPR28: 000000000000000c 00007fffb12418e0 000000014ce03ae0 000000000000000c
[  399.113256] NIP [00007fffb113b034] 0x7fffb113b034
[  399.113258] LR [00007fffb113b034] 0x7fffb113b034
[  399.113260] --- interrupt: 3000
Ming Lei April 10, 2025, 2:10 a.m. UTC | #19
On Thu, Apr 10, 2025 at 01:15:22AM +0530, Nilay Shroff wrote:
> 
> 
> On 4/9/25 7:38 PM, Ming Lei wrote:
> >>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> >>>>> involved wrt. switching elevator. If elevator switching is not allowed
> >>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> >>>>> lock?
> >>>>>
> >>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
> >>>> However my question was if we were to not allow elevator switch while 
> >>>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> >>>
> >>> It can be done easily by tag_set->srcu.
> >> Ok great if that's possible! But I'm not sure how it could be done in this
> >> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
> >> run in the writer/updater context. So you may still need lock? Can you
> >> please send across a (informal) patch with your idea ?

> On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote:
>> Yes new incoming NS may not be live yet when we iterate through
>> ctrl->namespaces. So we don't need bother about it yet.

If new NS needn't to be considered, the issue is easier to deal with updating
nr_hw_queues, such as:

- disable scheduler in 1st iterating tag_list

- update nr_hw_queues in 2nd iterating tag_list

- restore scheduler in 3rd iterating tag_list

->tag_list_lock is acquired & released in each iteration.

Then per-set lock isn't needed any more.

> > 
> > Please see the attached patch which treats elv_iosched_store() as reader.
> > 
> I looked trough patch and looks good. However we still have an issue 
> in code paths where we acquire ->elevator_lock without first freezing 
> the queue.In this case if we try allocate memory using GFP_KERNEL 
> then fs_reclaim would still trigger lockdep splat. As for this case
> the lock order would be,
> 
> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)

That is why I call ->elevator_lock is used too much, especially it is
depended for dealing with kobject & sysfs, which isn't needed in this way.

> 
> In fact, I tested your patch and got into the above splat. I've 
> attached lockdep splat for your reference. So IMO, we should instead
> try make locking order such that ->freeze_lock shall never depend on
> ->elevator_lock. It seems one way to make that possible if we make
> ->elevator_lock per-set. 

The attached patch is just for avoiding race between add/del disk vs.
updating nr_hw_queues, and it should have been for covering race between
adding/exiting request queue vs. updating nr_hw_queues.

If re-order can be done, that is definitely good, but...

> 
> >>>
> >>> Actually freeze lock is already held for nvme before calling
> >>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> >>> frozen for updating nr_hw_queues, so the above order may not match
> >>> with the existed code.
> >>>
> >>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> >>>
> >> I think we should consider (may be in different patch) updating
> >> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> >> its dependency on ->tag_list_lock.
> > 
> > If we need to take nvme into account, the above lock order doesn't work,
> > because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> > and elevator lock still depends on freeze lock.
> > 
> > If it needn't to be considered, per-set lock becomes necessary.
> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> we shall also update nvme pci, rdma and tcp drivers so that those 
> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.

This way you cause new deadlock or new trouble if you reply on freeze in
blk_mq_update_nr_hw_queues():

 ->tag_list_lock
 	freeze_lock

If timeout or io failure happens during the above freeze_lock, timeout
handler can not move on because new blk_mq_update_nr_hw_queues() can't
grab the lock.

Either deadlock or device has to been removed.

> 
> >>>>
> >>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
> >>>> help avoid few of the recent lockdep splats reported with fs_reclaim.
> >>>> What do you think?
> >>>
> >>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> >>> related splat.
> >>>
> >>> However, in del_gendisk(), freeze_lock is still held before calling
> >>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
> >>
> >> Yes agreed, however elevator_exit() called from del_gendisk() or 
> >> elv_unregister_queue() called from blk_unregister_queue() are called 
> >> after we unregister the queue. And if queue has been already unregistered
> >> while invoking elevator_exit or del_gensidk then ideally we don't need to
> >> acquire ->elevator_lock. The same is true for elevator_exit() called 
> >> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
> >> acquiring ->elevator_lock.
> > 
> > As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> > host wide event, so either lock or srcu sync is needed.
> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
> with del_gendisk or blk_unregister_queue.

Then the per-set lock is required in both add/del disk, then how to re-order
freeze_lock & elevator lock in del_gendisk()? 

And there is same risk with the one in blk_mq_update_nr_hw_queues().

> 
> Thanks,
> --Nilay

> [  399.112145] ======================================================
> [  399.112153] WARNING: possible circular locking dependency detected
> [  399.112160] 6.15.0-rc1+ #159 Not tainted
> [  399.112167] ------------------------------------------------------
> [  399.112174] bash/5891 is trying to acquire lock:
> [  399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4
> [  399.112205]
> [  399.112205] but task is already holding lock:
> [  399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
> [  399.112231]
> [  399.112231] which lock already depends on the new lock.
> [  399.112231]
> [  399.112239]
> [  399.112239] the existing dependency chain (in reverse order) is:
> [  399.112246]
> [  399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}:
> [  399.112260]        blk_alloc_queue+0x3a8/0x3e4
> [  399.112268]        blk_mq_alloc_queue+0x88/0x11c
> [  399.112278]        __blk_mq_alloc_disk+0x34/0xd8
> [  399.112290]        null_add_dev+0x3c8/0x914 [null_blk]
> [  399.112306]        null_init+0x1e0/0x4bc [null_blk]
> [  399.112319]        do_one_initcall+0x8c/0x4b8
> [  399.112340]        do_init_module+0x7c/0x2c4
> [  399.112396]        init_module_from_file+0xb4/0x108
> [  399.112405]        idempotent_init_module+0x26c/0x368
> [  399.112414]        sys_finit_module+0x98/0x150
> [  399.112422]        system_call_exception+0x134/0x360
> [  399.112430]        system_call_vectored_common+0x15c/0x2ec
> [  399.112441]
> [  399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}:
> [  399.112453]        fs_reclaim_acquire+0xe4/0x120
> [  399.112461]        kmem_cache_alloc_noprof+0x74/0x570
> [  399.112469]        __kernfs_new_node+0x98/0x37c
> [  399.112479]        kernfs_new_node+0x94/0xe4
> [  399.112490]        kernfs_create_dir_ns+0x44/0x120
> [  399.112501]        sysfs_create_dir_ns+0x94/0x160
> [  399.112512]        kobject_add_internal+0xf4/0x3c8
> [  399.112524]        kobject_add+0x70/0x10c
> [  399.112533]        elv_register_queue+0x70/0x14c
> [  399.112562]        blk_register_queue+0x1d8/0x2bc
> [  399.112575]        add_disk_fwnode+0x3b4/0x5d0
> [  399.112588]        sd_probe+0x3bc/0x5b4 [sd_mod]
> [  399.112601]        really_probe+0x104/0x4c4
> [  399.112613]        __driver_probe_device+0xb8/0x200
> [  399.112623]        driver_probe_device+0x54/0x128
> [  399.112632]        __driver_attach_async_helper+0x7c/0x150
> [  399.112641]        async_run_entry_fn+0x60/0x1bc
> [  399.112665]        process_one_work+0x2ac/0x7e4
> [  399.112678]        worker_thread+0x238/0x460
> [  399.112691]        kthread+0x158/0x188
> [  399.112700]        start_kernel_thread+0x14/0x18
> [  399.112722]
> [  399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}:

Another ways is to make sure that ->elevator_lock isn't required for dealing
with kobject/debugfs thing, which needs to refactor elevator code.

Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
actually need to drain reader, that is deadlock too.

Thanks,
Ming
Nilay Shroff April 10, 2025, 1:36 p.m. UTC | #20
On 4/10/25 7:40 AM, Ming Lei wrote:
> 
> If new NS needn't to be considered, the issue is easier to deal with updating
> nr_hw_queues, such as:
> 
> - disable scheduler in 1st iterating tag_list
> 
> - update nr_hw_queues in 2nd iterating tag_list
> 
> - restore scheduler in 3rd iterating tag_list
> 
> ->tag_list_lock is acquired & released in each iteration.
> 
> Then per-set lock isn't needed any more.
> 
Hmm still thinking...
>>>
>>> Please see the attached patch which treats elv_iosched_store() as reader.
>>>
>> I looked trough patch and looks good. However we still have an issue 
>> in code paths where we acquire ->elevator_lock without first freezing 
>> the queue.In this case if we try allocate memory using GFP_KERNEL 
>> then fs_reclaim would still trigger lockdep splat. As for this case
>> the lock order would be,
>>
>> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)
> 
> That is why I call ->elevator_lock is used too much, especially it is
> depended for dealing with kobject & sysfs, which isn't needed in this way.
> 
>>
>> In fact, I tested your patch and got into the above splat. I've 
>> attached lockdep splat for your reference. So IMO, we should instead
>> try make locking order such that ->freeze_lock shall never depend on
>> ->elevator_lock. It seems one way to make that possible if we make
>> ->elevator_lock per-set. 
> 
> The attached patch is just for avoiding race between add/del disk vs.
> updating nr_hw_queues, and it should have been for covering race between
> adding/exiting request queue vs. updating nr_hw_queues.
> 
Okay understood.

> If re-order can be done, that is definitely good, but...
Yeah so I tried re-ordering locks so that we first grab q->elevator_lock 
and then ->freeze_lock. As we know there's a challenge with re-arranging
the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify 
the lock order. I added one more tag_list iterator where we first acquire
->elevator lock and then in the next tag_list iterator we acquire 
->freeze_lock. I have also updated this lock order at other call sites.

Then as we have already discussed, there's also another challenge re-arranging
the lock order in del_gendisk, however, I managed to mitigate that by moving 
elevator_exit and elv_unregister_queue (from blk_unregister_queue)  just after
we delete queue tag set (or in another words when remove the queue from the 
tag-list) in del_gendisk. So that means that we could now safely invoke 
elv_unregister_queue and elevator_exit from  del_gendisk without needing to
acquire ->elevator_lock.

For reference, I attached a (informal) patch. Yes I know this patch would not
fix all splats. But at-least we would stop observing splat related to 
->frezze_lock dependency on ->elevator_lcok. 

> 
>>
>>>>>
>>>>> Actually freeze lock is already held for nvme before calling
>>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
>>>>> frozen for updating nr_hw_queues, so the above order may not match
>>>>> with the existed code.
>>>>>
>>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
>>>>>
>>>> I think we should consider (may be in different patch) updating
>>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
>>>> its dependency on ->tag_list_lock.
>>>
>>> If we need to take nvme into account, the above lock order doesn't work,
>>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
>>> and elevator lock still depends on freeze lock.
>>>
>>> If it needn't to be considered, per-set lock becomes necessary.
>> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
>> we shall also update nvme pci, rdma and tcp drivers so that those 
>> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
>> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
>> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.
> 
> This way you cause new deadlock or new trouble if you reply on freeze in
> blk_mq_update_nr_hw_queues():
> 
>  ->tag_list_lock
>  	freeze_lock
> 
> If timeout or io failure happens during the above freeze_lock, timeout
> handler can not move on because new blk_mq_update_nr_hw_queues() can't
> grab the lock.
> 
> Either deadlock or device has to been removed.
> 
With this new attached patch do you still foresee this issue? Yes this patch 
doesn't change anything with nvme driver, but later we may update nvme driver
so that nvme driver doesn't require freezing queue before invoking blk_mq_
update_nr_hw_queues. I think this is requires so that we follow the same lock
order in all call paths wrt ->elevator_lock and ->freeze_lock.

>>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
>>> host wide event, so either lock or srcu sync is needed.
>> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
>> with del_gendisk or blk_unregister_queue.
> 
> Then the per-set lock is required in both add/del disk, then how to re-order
> freeze_lock & elevator lock in del_gendisk()? 
> 
> And there is same risk with the one in blk_mq_update_nr_hw_queues().
> 
Yes please see above as I explained how we could potentially avoid lock order 
issue in del_gendisk.
> 
> Another ways is to make sure that ->elevator_lock isn't required for dealing
> with kobject/debugfs thing, which needs to refactor elevator code.
> 
> Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
> actually need to drain reader, that is deadlock too.
> 
I do agree. But I think lets first focus on cutting dependency of ->freeze_lock 
on ->elevator_lock. Once we get past it we may address other. 

This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up 
earlier") has now opened up pandora's box of lockdep splats :) 
Anyways it's good that we could now catch these issues early on. In general,
I feel that now this change would show up early on the issues where ->freeze_lock
depends on any other locks. 

Thanks,
--Nilay
Ming Lei April 10, 2025, 2:23 p.m. UTC | #21
On Thu, Apr 10, 2025 at 07:06:23PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/10/25 7:40 AM, Ming Lei wrote:
> > 
> > If new NS needn't to be considered, the issue is easier to deal with updating
> > nr_hw_queues, such as:
> > 
> > - disable scheduler in 1st iterating tag_list
> > 
> > - update nr_hw_queues in 2nd iterating tag_list
> > 
> > - restore scheduler in 3rd iterating tag_list
> > 
> > ->tag_list_lock is acquired & released in each iteration.
> > 
> > Then per-set lock isn't needed any more.
> > 
> Hmm still thinking...
> >>>
> >>> Please see the attached patch which treats elv_iosched_store() as reader.
> >>>
> >> I looked trough patch and looks good. However we still have an issue 
> >> in code paths where we acquire ->elevator_lock without first freezing 
> >> the queue.In this case if we try allocate memory using GFP_KERNEL 
> >> then fs_reclaim would still trigger lockdep splat. As for this case
> >> the lock order would be,
> >>
> >> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)
> > 
> > That is why I call ->elevator_lock is used too much, especially it is
> > depended for dealing with kobject & sysfs, which isn't needed in this way.
> > 
> >>
> >> In fact, I tested your patch and got into the above splat. I've 
> >> attached lockdep splat for your reference. So IMO, we should instead
> >> try make locking order such that ->freeze_lock shall never depend on
> >> ->elevator_lock. It seems one way to make that possible if we make
> >> ->elevator_lock per-set. 
> > 
> > The attached patch is just for avoiding race between add/del disk vs.
> > updating nr_hw_queues, and it should have been for covering race between
> > adding/exiting request queue vs. updating nr_hw_queues.
> > 
> Okay understood.
> 
> > If re-order can be done, that is definitely good, but...
> Yeah so I tried re-ordering locks so that we first grab q->elevator_lock 
> and then ->freeze_lock. As we know there's a challenge with re-arranging
> the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify 
> the lock order. I added one more tag_list iterator where we first acquire
> ->elevator lock and then in the next tag_list iterator we acquire 
> ->freeze_lock. I have also updated this lock order at other call sites.
> 
> Then as we have already discussed, there's also another challenge re-arranging
> the lock order in del_gendisk, however, I managed to mitigate that by moving 
> elevator_exit and elv_unregister_queue (from blk_unregister_queue)  just after
> we delete queue tag set (or in another words when remove the queue from the 
> tag-list) in del_gendisk. So that means that we could now safely invoke 
> elv_unregister_queue and elevator_exit from  del_gendisk without needing to
> acquire ->elevator_lock.
> 
> For reference, I attached a (informal) patch. Yes I know this patch would not
> fix all splats. But at-least we would stop observing splat related to 
> ->frezze_lock dependency on ->elevator_lcok. 

I just sent out the whole patchset, which did one thing basically: move kobject
& debugfs & cpuhp out of queue freezing, then no any lockdep is observed in my
test VM after running blktests of './check block/'.

So the point is _not_ related with elevator lock or its order with
freeze lock. What matters is actually "do not connect freeze lock with
other subsystem(debugfs, sysfs, cpuhp, ...)", because freeze_lock relies
on fs_reclaim directly with commit ffa1e7ada456, but other subsystem easily
depends on fs_reclaim again.

I did have patches to follow your idea to reorder elevator lock vs. freeze
lock, but it didn't make difference, even though after I killed most of
elevator_lock.

Finally I realized the above point.

> 
> > 
> >>
> >>>>>
> >>>>> Actually freeze lock is already held for nvme before calling
> >>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> >>>>> frozen for updating nr_hw_queues, so the above order may not match
> >>>>> with the existed code.
> >>>>>
> >>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> >>>>>
> >>>> I think we should consider (may be in different patch) updating
> >>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> >>>> its dependency on ->tag_list_lock.
> >>>
> >>> If we need to take nvme into account, the above lock order doesn't work,
> >>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> >>> and elevator lock still depends on freeze lock.
> >>>
> >>> If it needn't to be considered, per-set lock becomes necessary.
> >> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
> >> we shall also update nvme pci, rdma and tcp drivers so that those 
> >> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
> >> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
> >> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.
> > 
> > This way you cause new deadlock or new trouble if you reply on freeze in
> > blk_mq_update_nr_hw_queues():
> > 
> >  ->tag_list_lock
> >  	freeze_lock
> > 
> > If timeout or io failure happens during the above freeze_lock, timeout
> > handler can not move on because new blk_mq_update_nr_hw_queues() can't
> > grab the lock.
> > 
> > Either deadlock or device has to been removed.
> > 
> With this new attached patch do you still foresee this issue? Yes this patch 
> doesn't change anything with nvme driver, but later we may update nvme driver
> so that nvme driver doesn't require freezing queue before invoking blk_mq_
> update_nr_hw_queues. I think this is requires so that we follow the same lock
> order in all call paths wrt ->elevator_lock and ->freeze_lock.

nvme freeze lock is non_owner, and it can't be verified now, so don't use
nvme for running this lock test.

> 
> >>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> >>> host wide event, so either lock or srcu sync is needed.
> >> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
> >> with del_gendisk or blk_unregister_queue.
> > 
> > Then the per-set lock is required in both add/del disk, then how to re-order
> > freeze_lock & elevator lock in del_gendisk()? 
> > 
> > And there is same risk with the one in blk_mq_update_nr_hw_queues().
> > 
> Yes please see above as I explained how we could potentially avoid lock order 
> issue in del_gendisk.
> > 
> > Another ways is to make sure that ->elevator_lock isn't required for dealing
> > with kobject/debugfs thing, which needs to refactor elevator code.
> > 
> > Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
> > actually need to drain reader, that is deadlock too.
> > 
> I do agree. But I think lets first focus on cutting dependency of ->freeze_lock 
> on ->elevator_lock. Once we get past it we may address other. 
> 
> This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up 
> earlier") has now opened up pandora's box of lockdep splats :) 
> Anyways it's good that we could now catch these issues early on. In general,
> I feel that now this change would show up early on the issues where ->freeze_lock
> depends on any other locks. 

It also shows block layer freeze queue API should be used very carefully, fortunately
we have the powerful lockdep.



Thanks,
Ming
Nilay Shroff April 10, 2025, 2:48 p.m. UTC | #22
On 4/10/25 7:53 PM, Ming Lei wrote:
> 
> I just sent out the whole patchset, which did one thing basically: move kobject
> & debugfs & cpuhp out of queue freezing, then no any lockdep is observed in my
> test VM after running blktests of './check block/'.
Sure, I think our emails crossed over. I will review your changes.
> 
> So the point is _not_ related with elevator lock or its order with
> freeze lock. What matters is actually "do not connect freeze lock with
> other subsystem(debugfs, sysfs, cpuhp, ...)", because freeze_lock relies
> on fs_reclaim directly with commit ffa1e7ada456, but other subsystem easily
> depends on fs_reclaim again.
> 
> I did have patches to follow your idea to reorder elevator lock vs. freeze
> lock, but it didn't make difference, even though after I killed most of
> elevator_lock.
> 
> Finally I realized the above point.
Ok great:)

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae8494d88897..d7a103dc258b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4465,14 +4465,12 @@  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 	return NULL;
 }
 
-static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-						struct request_queue *q)
+static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
+				     struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i, j;
 
-	/* protect against switching io scheduler  */
-	mutex_lock(&q->elevator_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int old_node;
 		int node = blk_mq_get_hctx_node(set, i);
@@ -4505,7 +4503,19 @@  static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 	xa_for_each_start(&q->hctx_table, j, hctx, j)
 		blk_mq_exit_hctx(q, set, hctx, j);
-	mutex_unlock(&q->elevator_lock);
+}
+
+static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
+				   struct request_queue *q, bool lock)
+{
+	if (lock) {
+		/* protect against switching io scheduler  */
+		mutex_lock(&q->elevator_lock);
+		__blk_mq_realloc_hw_ctxs(set, q);
+		mutex_unlock(&q->elevator_lock);
+	} else {
+		__blk_mq_realloc_hw_ctxs(set, q);
+	}
 
 	/* unregister cpuhp callbacks for exited hctxs */
 	blk_mq_remove_hw_queues_cpuhp(q);
@@ -4537,7 +4547,7 @@  int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
-	blk_mq_realloc_hw_ctxs(set, q);
+	blk_mq_realloc_hw_ctxs(set, q, false);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
@@ -5033,7 +5043,7 @@  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
+		blk_mq_realloc_hw_ctxs(set, q, true);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;