diff mbox series

[1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock

Message ID 20250205144506.663819-2-nilay@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series block: fix lock order and remove redundant locking | expand

Commit Message

Nilay Shroff Feb. 5, 2025, 2:44 p.m. UTC
Lockdep reports [1] have identified inconsistent lock ordering between
q->sysfs_lock and freeze-lock at several call sites in the block layer.
This patch resolves the issue by enforcing a consistent lock acquisition
order: q->sysfs_lock is always acquired before freeze-lock. This change
eliminates the observed lockdep splats caused by the inconsistent
ordering.

Additionally, while rearranging the locking order, we ensure that no new
lock ordering issues are introduced between the global CPU hotplug (cpuhp)
lock and q->sysfs_lock, as previously reported [2]. To address this,
blk_mq_add_hw_queues_cpuhp() and blk_mq_remove_hw_queues_cpuhp() are now
called outside the critical section protected by q->sysfs_lock.

Since blk_mq_add_hw_queues_cpuhp() and blk_mq_remove_hw_queues_cpuhp()
are invoked during hardware context allocation via blk_mq_realloc_hw_
ctxs(), which runs holding q->sysfs_lock, we've relocated the add/remove
cpuhp function calls to __blk_mq_update_nr_hw_queues() and blk_mq_init_
allocated_queue() after the q->sysfs_lock is released. This ensures proper
lock ordering without introducing regressions.

[1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/
[2] https://lore.kernel.org/all/20241206082202.949142-1-ming.lei@redhat.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq.c   | 49 ++++++++++++++++++++++++++++++++----------------
 block/elevator.c |  9 +++++++++
 2 files changed, 42 insertions(+), 16 deletions(-)

Comments

Christoph Hellwig Feb. 5, 2025, 3:59 p.m. UTC | #1
On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>  
>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  		return;
>  
>  	memflags = memalloc_noio_save();
> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		mutex_lock(&q->sysfs_lock);

This now means we hold up to number of request queues sysfs_lock
at the same time.  I doubt lockdep will be happy about this.
Did you test this patch with a multi-namespace nvme device or
a multi-LU per host SCSI setup?

I suspect the answer here is to (ab)use the tag_list_lock for
scheduler updates - while the scope is too broad for just
changing it on a single queue it is a rate operation and it
solves the mess in __blk_mq_update_nr_hw_queues.
Nilay Shroff Feb. 6, 2025, 1:22 p.m. UTC | #2
On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>  
>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  		return;
>>  
>>  	memflags = memalloc_noio_save();
>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		mutex_lock(&q->sysfs_lock);
> 
> This now means we hold up to number of request queues sysfs_lock
> at the same time.  I doubt lockdep will be happy about this.
> Did you test this patch with a multi-namespace nvme device or
> a multi-LU per host SCSI setup?
> 
Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
complain. Agreed we need to hold up q->sysfs_lock for multiple 
request queues at the same time and that may not be elegant, but 
looking at the mess in __blk_mq_update_nr_hw_queues we may not
have other choice which could help correct the lock order.

> I suspect the answer here is to (ab)use the tag_list_lock for
> scheduler updates - while the scope is too broad for just
> changing it on a single queue it is a rate operation and it
> solves the mess in __blk_mq_update_nr_hw_queues.
> 
Yes this is probably a good idea, that instead of using q->sysfs_lock 
we may depend on q->tag_set->tag_list_lock here for sched/elevator updates
as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock
held. But then it also requires using the same tag_list_lock instead of 
current sysfs_lock while we update the scheduler from sysfs. But that's
a trivial change.

Thanks,
--Nilay
Christoph Hellwig Feb. 6, 2025, 2:15 p.m. UTC | #3
On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> request queues at the same time and that may not be elegant, but 
> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> have other choice which could help correct the lock order.

Odd, as it's usually very unhappy about nesting locks of the
same kind unless specifically annotated.

> Yes this is probably a good idea, that instead of using q->sysfs_lock 
> we may depend on q->tag_set->tag_list_lock here for sched/elevator updates
> as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock
> held.

Yes.

> But then it also requires using the same tag_list_lock instead of 
> current sysfs_lock while we update the scheduler from sysfs. But that's
> a trivial change.

Yes.  I think it's a good idea, but maybe wait a bit to see if Jens
or Ming also have opinions on this before starting the work.
Ming Lei Feb. 7, 2025, 11:59 a.m. UTC | #4
On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> > On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
> >>  
> >>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>  		return;
> >>  
> >>  	memflags = memalloc_noio_save();
> >> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> >> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >> +		mutex_lock(&q->sysfs_lock);
> > 
> > This now means we hold up to number of request queues sysfs_lock
> > at the same time.  I doubt lockdep will be happy about this.
> > Did you test this patch with a multi-namespace nvme device or
> > a multi-LU per host SCSI setup?
> > 
> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> request queues at the same time and that may not be elegant, but 
> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> have other choice which could help correct the lock order.

All q->sysfs_lock instance actually shares same lock class, so this way
should have triggered double lock warning, please see mutex_init().

The ->sysfs_lock involved in this patch looks only for sync elevator
switch with reallocating hctxs, so I am wondering why not add new
dedicated lock for this purpose only?

Then we needn't to worry about its dependency with q->q_usage_counter(io)?

Thanks,
Ming
Nilay Shroff Feb. 7, 2025, 6:02 p.m. UTC | #5
On 2/7/25 5:29 PM, Ming Lei wrote:
> On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
>>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>>>  
>>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>  		return;
>>>>  
>>>>  	memflags = memalloc_noio_save();
>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>> +		mutex_lock(&q->sysfs_lock);
>>>
>>> This now means we hold up to number of request queues sysfs_lock
>>> at the same time.  I doubt lockdep will be happy about this.
>>> Did you test this patch with a multi-namespace nvme device or
>>> a multi-LU per host SCSI setup?
>>>
>> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
>> complain. Agreed we need to hold up q->sysfs_lock for multiple 
>> request queues at the same time and that may not be elegant, but 
>> looking at the mess in __blk_mq_update_nr_hw_queues we may not
>> have other choice which could help correct the lock order.
> 
> All q->sysfs_lock instance actually shares same lock class, so this way
> should have triggered double lock warning, please see mutex_init().
> 
Well, my understanding about lockdep is that even though all q->sysfs_lock
instances share the same lock class key, lockdep differentiates locks 
based on their memory address. Since each instance of &q->sysfs_lock has 
got different memory address, lockdep treat each of them as distinct locks 
and IMO, that avoids triggering double lock warning.

> The ->sysfs_lock involved in this patch looks only for sync elevator
> switch with reallocating hctxs, so I am wondering why not add new
> dedicated lock for this purpose only?
> 
> Then we needn't to worry about its dependency with q->q_usage_counter(io)?
> 
Yes that should be possible but then as Christoph suggested, __blk_mq_update_
nr_hw_queues already runs holding tag_list_lock and so why shouldn't we use
the same tag_list_lock even for sched/elevator updates? That way we may avoid
adding another new lock.

Thanks,
--Nilay
Ming Lei Feb. 8, 2025, 8:30 a.m. UTC | #6
On Fri, Feb 07, 2025 at 11:32:37PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/7/25 5:29 PM, Ming Lei wrote:
> > On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> >>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
> >>>>  
> >>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>>>  		return;
> >>>>  
> >>>>  	memflags = memalloc_noio_save();
> >>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> >>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >>>> +		mutex_lock(&q->sysfs_lock);
> >>>
> >>> This now means we hold up to number of request queues sysfs_lock
> >>> at the same time.  I doubt lockdep will be happy about this.
> >>> Did you test this patch with a multi-namespace nvme device or
> >>> a multi-LU per host SCSI setup?
> >>>
> >> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> >> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> >> request queues at the same time and that may not be elegant, but 
> >> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> >> have other choice which could help correct the lock order.
> > 
> > All q->sysfs_lock instance actually shares same lock class, so this way
> > should have triggered double lock warning, please see mutex_init().
> > 
> Well, my understanding about lockdep is that even though all q->sysfs_lock
> instances share the same lock class key, lockdep differentiates locks 
> based on their memory address. Since each instance of &q->sysfs_lock has 
> got different memory address, lockdep treat each of them as distinct locks 
> and IMO, that avoids triggering double lock warning.

That isn't correct, think about how lockdep can deal with millions of
lock instances.

Please take a look at the beginning of Documentation/locking/lockdep-design.rst

```
The validator tracks the 'usage state' of lock-classes, and it tracks
the dependencies between different lock-classes.
```

Please verify it by the following code:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e76651e786d..a4ffc6198e7b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5150,10 +5150,37 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
 		cancel_delayed_work_sync(&hctx->run_work);
 }

+struct lock_test {
+	struct mutex	lock;
+};
+
+void init_lock_test(struct lock_test *lt)
+{
+	mutex_init(&lt->lock);
+	printk("init lock: %p\n", lt);
+}
+
+static void test_lockdep(void)
+{
+	struct lock_test A, B;
+
+	init_lock_test(&A);
+	init_lock_test(&B);
+
+	printk("start lock test\n");
+	mutex_lock(&A.lock);
+	mutex_lock(&B.lock);
+	mutex_unlock(&B.lock);
+	mutex_unlock(&A.lock);
+	printk("end lock test\n");
+}
+
 static int __init blk_mq_init(void)
 {
 	int i;

+	test_lockdep();
+
 	for_each_possible_cpu(i)
 		init_llist_head(&per_cpu(blk_cpu_done, i));
 	for_each_possible_cpu(i)



Thanks,
Ming
Nilay Shroff Feb. 8, 2025, 1:18 p.m. UTC | #7
On 2/8/25 2:00 PM, Ming Lei wrote:
> On Fri, Feb 07, 2025 at 11:32:37PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/7/25 5:29 PM, Ming Lei wrote:
>>> On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
>>>>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>>>>>  
>>>>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>>  		return;
>>>>>>  
>>>>>>  	memflags = memalloc_noio_save();
>>>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>>>> +		mutex_lock(&q->sysfs_lock);
>>>>>
>>>>> This now means we hold up to number of request queues sysfs_lock
>>>>> at the same time.  I doubt lockdep will be happy about this.
>>>>> Did you test this patch with a multi-namespace nvme device or
>>>>> a multi-LU per host SCSI setup?
>>>>>
>>>> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
>>>> complain. Agreed we need to hold up q->sysfs_lock for multiple 
>>>> request queues at the same time and that may not be elegant, but 
>>>> looking at the mess in __blk_mq_update_nr_hw_queues we may not
>>>> have other choice which could help correct the lock order.
>>>
>>> All q->sysfs_lock instance actually shares same lock class, so this way
>>> should have triggered double lock warning, please see mutex_init().
>>>
>> Well, my understanding about lockdep is that even though all q->sysfs_lock
>> instances share the same lock class key, lockdep differentiates locks 
>> based on their memory address. Since each instance of &q->sysfs_lock has 
>> got different memory address, lockdep treat each of them as distinct locks 
>> and IMO, that avoids triggering double lock warning.
> 
> That isn't correct, think about how lockdep can deal with millions of
> lock instances.
> 
> Please take a look at the beginning of Documentation/locking/lockdep-design.rst
> 
> ```
> The validator tracks the 'usage state' of lock-classes, and it tracks
> the dependencies between different lock-classes.
> ```
> 
> Please verify it by the following code:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4e76651e786d..a4ffc6198e7b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5150,10 +5150,37 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
>  		cancel_delayed_work_sync(&hctx->run_work);
>  }
> 
> +struct lock_test {
> +	struct mutex	lock;
> +};
> +
> +void init_lock_test(struct lock_test *lt)
> +{
> +	mutex_init(&lt->lock);
> +	printk("init lock: %p\n", lt);
> +}
> +
> +static void test_lockdep(void)
> +{
> +	struct lock_test A, B;
> +
> +	init_lock_test(&A);
> +	init_lock_test(&B);
> +
> +	printk("start lock test\n");
> +	mutex_lock(&A.lock);
> +	mutex_lock(&B.lock);
> +	mutex_unlock(&B.lock);
> +	mutex_unlock(&A.lock);
> +	printk("end lock test\n");
> +}
> +
>  static int __init blk_mq_init(void)
>  {
>  	int i;
> 
> +	test_lockdep();
> +
>  	for_each_possible_cpu(i)
>  		init_llist_head(&per_cpu(blk_cpu_done, i));
>  	for_each_possible_cpu(i)
> 
> 
> 
Thank you Ming for providing the patch for testing lockdep!
You and Christoph were correct. The lockdep should complain about possible 
recursive locking for q->sysfs_lock and after a bit of debugging I think I found
the cause about why on my system lockdep was unable to complain about recursive locking. 
The reason is on my test system, I enabled KASAN and KASAN reported a potential 
use-after-free bug that tainted the kernel and disabled the further lock debugging. 
Hence any subsequent locking issues were not detected by lockdep. 

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 40490ac88045..87200539b3cc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4467,7 +4467,8 @@  static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	unsigned long i, j;
 
 	/* protect against switching io scheduler  */
-	mutex_lock(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_lock);
+
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int old_node;
 		int node = blk_mq_get_hctx_node(set, i);
@@ -4500,13 +4501,6 @@  static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 	xa_for_each_start(&q->hctx_table, j, hctx, j)
 		blk_mq_exit_hctx(q, set, hctx, j);
-	mutex_unlock(&q->sysfs_lock);
-
-	/* unregister cpuhp callbacks for exited hctxs */
-	blk_mq_remove_hw_queues_cpuhp(q);
-
-	/* register cpuhp for new initialized hctxs */
-	blk_mq_add_hw_queues_cpuhp(q);
 }
 
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
@@ -4532,10 +4526,19 @@  int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
+	mutex_lock(&q->sysfs_lock);
 	blk_mq_realloc_hw_ctxs(set, q);
+	mutex_unlock(&q->sysfs_lock);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
+	/*
+	 * Register cpuhp for new initialized hctxs and ensure that the cpuhp
+	 * registration happens outside of q->sysfs_lock to avoid any lock
+	 * ordering issue between q->sysfs_lock and global cpuhp lock.
+	 */
+	blk_mq_add_hw_queues_cpuhp(q);
+
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -4934,12 +4937,12 @@  static bool blk_mq_elv_switch_none(struct list_head *head,
 		return false;
 
 	/* q->elevator needs protection from ->sysfs_lock */
-	mutex_lock(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_lock);
 
 	/* the check has to be done with holding sysfs_lock */
 	if (!q->elevator) {
 		kfree(qe);
-		goto unlock;
+		goto out;
 	}
 
 	INIT_LIST_HEAD(&qe->node);
@@ -4949,8 +4952,7 @@  static bool blk_mq_elv_switch_none(struct list_head *head,
 	__elevator_get(qe->type);
 	list_add(&qe->node, head);
 	elevator_disable(q);
-unlock:
-	mutex_unlock(&q->sysfs_lock);
+out:
 
 	return true;
 }
@@ -4973,6 +4975,8 @@  static void blk_mq_elv_switch_back(struct list_head *head,
 	struct blk_mq_qe_pair *qe;
 	struct elevator_type *t;
 
+	lockdep_assert_held(&q->sysfs_lock);
+
 	qe = blk_lookup_qe_pair(head, q);
 	if (!qe)
 		return;
@@ -4980,11 +4984,9 @@  static void blk_mq_elv_switch_back(struct list_head *head,
 	list_del(&qe->node);
 	kfree(qe);
 
-	mutex_lock(&q->sysfs_lock);
 	elevator_switch(q, t);
 	/* drop the reference acquired in blk_mq_elv_switch_none */
 	elevator_put(t);
-	mutex_unlock(&q->sysfs_lock);
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
@@ -5006,8 +5008,10 @@  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		mutex_lock(&q->sysfs_lock);
 		blk_mq_freeze_queue_nomemsave(q);
+	}
 
 	/*
 	 * Switch IO scheduler to 'none', cleaning up the data associated
@@ -5055,8 +5059,21 @@  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_elv_switch_back(&head, q);
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		mutex_unlock(&q->sysfs_lock);
+
+		/*
+		 * Unregister cpuhp callbacks for exited hctxs and register
+		 * cpuhp for new initialized hctxs. Ensure that unregister/
+		 * register cpuhp is called outside of q->sysfs_lock to avoid
+		 * lock ordering issue between q->sysfs_lock and  global cpuhp
+		 * lock.
+		 */
+		blk_mq_remove_hw_queues_cpuhp(q);
+		blk_mq_add_hw_queues_cpuhp(q);
+
 		blk_mq_unfreeze_queue_nomemrestore(q);
+	}
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/elevator.c b/block/elevator.c
index cd2ce4921601..596eb5c0219f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -725,7 +725,16 @@  ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	int ret;
 
 	strscpy(elevator_name, buf, sizeof(elevator_name));
+
+	/*
+	 * The elevator change/switch code expects that the q->sysfs_lock
+	 * is held while we update the iosched to protect against the
+	 * simultaneous hctx update.
+	 */
+	mutex_lock(&disk->queue->sysfs_lock);
 	ret = elevator_change(disk->queue, strstrip(elevator_name));
+	mutex_unlock(&disk->queue->sysfs_lock);
+
 	if (!ret)
 		return count;
 	return ret;