diff mbox series

[2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes

Message ID 20250205144506.663819-3-nilay@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series block: fix lock order and remove redundant locking | expand

Commit Message

Nilay Shroff Feb. 5, 2025, 2:44 p.m. UTC
The sysfs attributes are already protected with sysfs/kernfs internal
locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
attribute files. So this change helps avoid holding q->sysfs_lock while
accessing sysfs attribute files.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sysfs.c |  6 +-----
 block/blk-sysfs.c    | 10 +++-------
 2 files changed, 4 insertions(+), 12 deletions(-)

Comments

Christoph Hellwig Feb. 5, 2025, 3:53 p.m. UTC | #1
On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> The sysfs attributes are already protected with sysfs/kernfs internal
> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> attribute files. So this change helps avoid holding q->sysfs_lock while
> accessing sysfs attribute files.

the sysfs/kernfs locking only protects against other accesses using
sysfs.  But that's not really the most interesting part here.  We
also want to make sure nothing changes underneath in a way that
could cause crashes (and maybe even torn information).

We'll really need to audit what is accessed in each method and figure
out what protects it.  Chances are that sysfs_lock provides that
protection in some case right now, and chances are also very high
that a lot of this is pretty broken.
Nilay Shroff Feb. 6, 2025, 1:54 p.m. UTC | #2
On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>> The sysfs attributes are already protected with sysfs/kernfs internal
>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>> attribute files. So this change helps avoid holding q->sysfs_lock while
>> accessing sysfs attribute files.
> 
> the sysfs/kernfs locking only protects against other accesses using
> sysfs.  But that's not really the most interesting part here.  We
> also want to make sure nothing changes underneath in a way that
> could cause crashes (and maybe even torn information).
> 
> We'll really need to audit what is accessed in each method and figure
> out what protects it.  Chances are that sysfs_lock provides that
> protection in some case right now, and chances are also very high
> that a lot of this is pretty broken.
> 
Yes that's possible and so I audited all sysfs attributes which are 
currently protected using q->sysfs_lock and I found some interesting
facts. Please find below:

1. io_poll:
   Write to this attribute is ignored. So, we don't need q->sysfs_lock.

2. io_poll_delay:
   Write to this attribute is NOP, so we don't need q->sysfs_lock.

3. io_timeout:
   Write to this attribute updates q->rq_timeout and read of this attribute
   returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
   set only once when we init the queue (under blk_mq_init_allocated_queue())
   even before disk is added. So that means that we may not need to protect
   it with q->sysfs_lock.

4. nomerges:
   Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
   and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
   anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
   updated/accessed with bitops which are atomic. So, I believe, protecting
   it with q->sysfs_lock is not necessary.

5. nr_requests:
   Write to this attribute updates the tag sets and this could potentially
   race with __blk_mq_update_nr_hw_queues(). So I think we should really 
   protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.

6. read_ahead_kb:
   Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
   ra_pages is also updated under queue_limits_commit_update() which runs 
   holding q->limits_lock; so I think this attribute file should be protected
   with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
   Maybe we should move it under the same sets of attribute files which today
   runs with q->limits_lock held.

7. rq_affinity:
   Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
   and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
   using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
   protecting it with q->stsys_lock is not necessary.

8. scheduler:
   Write to this attribute actually updates q->elevator and the elevator change/switch 
   code expects that the q->sysfs_lock is held while we update the iosched to protect 
   against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
   q->sysfs_lock protection.

   However if we're thinking of protecting sched change/update using q->tag_sets->
   tag_list_lock (as discussed in another thread), then we may use q->tag_set->
   tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
   file.

9. wbt_lat_usec:
   Write to this attribute file updates the various wbt limits and state. This may race 
   with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
   blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
   and so yes, we need to protect this attribute with q->sysfs_lock.

   However, as mentioned above, if we're thinking of protecting elevator change/update
   using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
   q->sysfs_lock while reading/writing to this attribute file.

So yes, you've rightly guessed few of the above attributes are not well protected and few
still may require sysfs_lock protection. 

From the above list, I see that the "read_ahead_kb" should be protected with q->limits_lock.

The "nr_requests" should be protected with q->tag_set->tag_list_lock instead of q->sysfs_lock.

The "scheduler" and "wbt_lat_usec" require protection using either q->sysfs_lock or 
q->tag_set->tag_list_lock. 

The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
those could be well protected with the sysfs/kernfs internal lock.

Thanks,
--Nilay
Christoph Hellwig Feb. 6, 2025, 2:07 p.m. UTC | #3
On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> Yes that's possible and so I audited all sysfs attributes which are 
> currently protected using q->sysfs_lock and I found some interesting
> facts. Please find below:
> 
> 1. io_poll:
>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> 
> 2. io_poll_delay:
>    Write to this attribute is NOP, so we don't need q->sysfs_lock.

Yes, those are easy.

> 3. io_timeout:
>    Write to this attribute updates q->rq_timeout and read of this attribute
>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>    even before disk is added. So that means that we may not need to protect
>    it with q->sysfs_lock.

Are we sure blk_queue_rq_timeout is never called from anything but
probe?  Either way given that this is a limit that isn't directly
corelated with anything else simply using WRITE_ONCE/READ_ONCE might
be enough.

> 
> 4. nomerges:
>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>    updated/accessed with bitops which are atomic. So, I believe, protecting
>    it with q->sysfs_lock is not necessary.

Yes.

> 5. nr_requests:
>    Write to this attribute updates the tag sets and this could potentially
>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.

Yeah.

> 6. read_ahead_kb:
>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>    ra_pages is also updated under queue_limits_commit_update() which runs 
>    holding q->limits_lock; so I think this attribute file should be protected
>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>    Maybe we should move it under the same sets of attribute files which today
>    runs with q->limits_lock held.

Yes, limits_lock sounds sensible here.

> 7. rq_affinity:
>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>    protecting it with q->stsys_lock is not necessary.

Agreed.  Although updating both flags isn't atomic that should be
harmless in this case (but could use a comment about that).

> 8. scheduler:
>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>    q->sysfs_lock protection.
> 
>    However if we're thinking of protecting sched change/update using q->tag_sets->
>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>    file.

Yes.

> 9. wbt_lat_usec:
>    Write to this attribute file updates the various wbt limits and state. This may race 
>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>    and so yes, we need to protect this attribute with q->sysfs_lock.
> 
>    However, as mentioned above, if we're thinking of protecting elevator change/update
>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>    q->sysfs_lock while reading/writing to this attribute file.

Yes.

> The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
> those could be well protected with the sysfs/kernfs internal lock.

So I think the first step here is to remove the locking from
queue_attr_store/queue_attr_show and move it into the attributes that
still need it in some form, followed by replacing it with the more
suitable locks as defined above.  And maybe with a little bit more
work we can remove sysfs_lock entirely eventually.
Nilay Shroff Feb. 7, 2025, 11:03 a.m. UTC | #4
On 2/6/25 7:37 PM, Christoph Hellwig wrote:
> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>> Yes that's possible and so I audited all sysfs attributes which are 
>> currently protected using q->sysfs_lock and I found some interesting
>> facts. Please find below:
>>
>> 1. io_poll:
>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>
>> 2. io_poll_delay:
>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> 
> Yes, those are easy.
> 
>> 3. io_timeout:
>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>    even before disk is added. So that means that we may not need to protect
>>    it with q->sysfs_lock.
> 
> Are we sure blk_queue_rq_timeout is never called from anything but
> probe?  Either way given that this is a limit that isn't directly
> corelated with anything else simply using WRITE_ONCE/READ_ONCE might
> be enough.
I just again grep source code and confirmed that blk_queue_rq_timeout is 
mostly called from the driver probe method (barring nbd driver which may
set q->rq_timeout using ioctl). And yes, agreed, q->rq_timeout can be
accessed using WRITE_ONCE/READ_ONCE.

>>
>> 4. nomerges:
>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>    it with q->sysfs_lock is not necessary.
> 
> Yes.
> 
>> 5. nr_requests:
>>    Write to this attribute updates the tag sets and this could potentially
>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> 
> Yeah.
> 
>> 6. read_ahead_kb:
>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>    holding q->limits_lock; so I think this attribute file should be protected
>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>    Maybe we should move it under the same sets of attribute files which today
>>    runs with q->limits_lock held.
> 
> Yes, limits_lock sounds sensible here.
> 
>> 7. rq_affinity:
>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>    protecting it with q->stsys_lock is not necessary.
> 
> Agreed.  Although updating both flags isn't atomic that should be
> harmless in this case (but could use a comment about that).
Sure will add comment about this.
> 
>> 8. scheduler:
>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>    q->sysfs_lock protection.
>>
>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>    file.
> 
> Yes.
> 
>> 9. wbt_lat_usec:
>>    Write to this attribute file updates the various wbt limits and state. This may race 
>>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>>    and so yes, we need to protect this attribute with q->sysfs_lock.
>>
>>    However, as mentioned above, if we're thinking of protecting elevator change/update
>>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>>    q->sysfs_lock while reading/writing to this attribute file.
> 
> Yes.
> 
>> The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
>> those could be well protected with the sysfs/kernfs internal lock.
> 
> So I think the first step here is to remove the locking from
> queue_attr_store/queue_attr_show and move it into the attributes that
> still need it in some form, followed by replacing it with the more
> suitable locks as defined above.  And maybe with a little bit more
> work we can remove sysfs_lock entirely eventually.
Yes this sounds like a good plan! 

Thank you Christoph for your review! And as you suggested, in another thread, 
I'd wait some more time for Jens and Ming, if in case they have any further 
comments about this change.

Thanks,
--Nilay
Ming Lei Feb. 8, 2025, 10:41 a.m. UTC | #5
On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> > On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> >> The sysfs attributes are already protected with sysfs/kernfs internal
> >> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> >> attribute files. So this change helps avoid holding q->sysfs_lock while
> >> accessing sysfs attribute files.
> > 
> > the sysfs/kernfs locking only protects against other accesses using
> > sysfs.  But that's not really the most interesting part here.  We
> > also want to make sure nothing changes underneath in a way that
> > could cause crashes (and maybe even torn information).
> > 
> > We'll really need to audit what is accessed in each method and figure
> > out what protects it.  Chances are that sysfs_lock provides that
> > protection in some case right now, and chances are also very high
> > that a lot of this is pretty broken.
> > 
> Yes that's possible and so I audited all sysfs attributes which are 
> currently protected using q->sysfs_lock and I found some interesting
> facts. Please find below:
> 
> 1. io_poll:
>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> 
> 2. io_poll_delay:
>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> 
> 3. io_timeout:
>    Write to this attribute updates q->rq_timeout and read of this attribute
>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>    even before disk is added. So that means that we may not need to protect
>    it with q->sysfs_lock.
> 
> 4. nomerges:
>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>    updated/accessed with bitops which are atomic. So, I believe, protecting
>    it with q->sysfs_lock is not necessary.
> 
> 5. nr_requests:
>    Write to this attribute updates the tag sets and this could potentially
>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> 
> 6. read_ahead_kb:
>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>    ra_pages is also updated under queue_limits_commit_update() which runs 
>    holding q->limits_lock; so I think this attribute file should be protected
>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>    Maybe we should move it under the same sets of attribute files which today
>    runs with q->limits_lock held.
> 
> 7. rq_affinity:
>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>    protecting it with q->stsys_lock is not necessary.
> 
> 8. scheduler:
>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>    q->sysfs_lock protection.
> 
>    However if we're thinking of protecting sched change/update using q->tag_sets->
>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>    file.

This is one misuse of tag_list_lock, which is supposed to cover host
wide change, and shouldn't be used for request queue level protection,
which is exactly provided by q->sysfs_lock.

Not mention it will cause ABBA deadlock over freeze lock, please see
blk_mq_update_nr_hw_queues(). And it can't be used for protecting
'nr_requests' too.

> 
> 9. wbt_lat_usec:
>    Write to this attribute file updates the various wbt limits and state. This may race 
>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>    and so yes, we need to protect this attribute with q->sysfs_lock.
> 
>    However, as mentioned above, if we're thinking of protecting elevator change/update
>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>    q->sysfs_lock while reading/writing to this attribute file.

Same ABBA deadlock with above.


Thanks,
Ming
Nilay Shroff Feb. 8, 2025, 12:56 p.m. UTC | #6
On 2/8/25 4:11 PM, Ming Lei wrote:
> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>>>> The sysfs attributes are already protected with sysfs/kernfs internal
>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>>>> attribute files. So this change helps avoid holding q->sysfs_lock while
>>>> accessing sysfs attribute files.
>>>
>>> the sysfs/kernfs locking only protects against other accesses using
>>> sysfs.  But that's not really the most interesting part here.  We
>>> also want to make sure nothing changes underneath in a way that
>>> could cause crashes (and maybe even torn information).
>>>
>>> We'll really need to audit what is accessed in each method and figure
>>> out what protects it.  Chances are that sysfs_lock provides that
>>> protection in some case right now, and chances are also very high
>>> that a lot of this is pretty broken.
>>>
>> Yes that's possible and so I audited all sysfs attributes which are 
>> currently protected using q->sysfs_lock and I found some interesting
>> facts. Please find below:
>>
>> 1. io_poll:
>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>
>> 2. io_poll_delay:
>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
>>
>> 3. io_timeout:
>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>    even before disk is added. So that means that we may not need to protect
>>    it with q->sysfs_lock.
>>
>> 4. nomerges:
>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>    it with q->sysfs_lock is not necessary.
>>
>> 5. nr_requests:
>>    Write to this attribute updates the tag sets and this could potentially
>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
>>
>> 6. read_ahead_kb:
>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>    holding q->limits_lock; so I think this attribute file should be protected
>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>    Maybe we should move it under the same sets of attribute files which today
>>    runs with q->limits_lock held.
>>
>> 7. rq_affinity:
>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>    protecting it with q->stsys_lock is not necessary.
>>
>> 8. scheduler:
>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>    q->sysfs_lock protection.
>>
>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>    file.
> 
> This is one misuse of tag_list_lock, which is supposed to cover host
> wide change, and shouldn't be used for request queue level protection,
> which is exactly provided by q->sysfs_lock.
> 
Yes I think Christoph was also pointed about the same but then assuming 
schedule/elevator update would be a rare operation it may not cause
a lot of contention. Having said that, I'm also fine creating another 
lock just to protect elevator changes and removing ->sysfs_lock from 
elevator code.

> Not mention it will cause ABBA deadlock over freeze lock, please see
> blk_mq_update_nr_hw_queues(). And it can't be used for protecting
> 'nr_requests' too.
I don't know how this might cause ABBA deadlock. The proposal here's to 
use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
attribute from sysfs as well as while we update the elevator through 
__blk_mq_update_nr_hw_queues().

In each code path (either from sysfs attribute update or from nr_hw_queues 
update), we first acquire ->tag_list_lock and then freeze-lock.

Do you see any code path where the above order might not be followed?  	

Thanks,
--Nilay
Ming Lei Feb. 9, 2025, 11:41 a.m. UTC | #7
On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/8/25 4:11 PM, Ming Lei wrote:
> > On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> >>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> >>>> The sysfs attributes are already protected with sysfs/kernfs internal
> >>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> >>>> attribute files. So this change helps avoid holding q->sysfs_lock while
> >>>> accessing sysfs attribute files.
> >>>
> >>> the sysfs/kernfs locking only protects against other accesses using
> >>> sysfs.  But that's not really the most interesting part here.  We
> >>> also want to make sure nothing changes underneath in a way that
> >>> could cause crashes (and maybe even torn information).
> >>>
> >>> We'll really need to audit what is accessed in each method and figure
> >>> out what protects it.  Chances are that sysfs_lock provides that
> >>> protection in some case right now, and chances are also very high
> >>> that a lot of this is pretty broken.
> >>>
> >> Yes that's possible and so I audited all sysfs attributes which are 
> >> currently protected using q->sysfs_lock and I found some interesting
> >> facts. Please find below:
> >>
> >> 1. io_poll:
> >>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> >>
> >> 2. io_poll_delay:
> >>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> >>
> >> 3. io_timeout:
> >>    Write to this attribute updates q->rq_timeout and read of this attribute
> >>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
> >>    set only once when we init the queue (under blk_mq_init_allocated_queue())
> >>    even before disk is added. So that means that we may not need to protect
> >>    it with q->sysfs_lock.
> >>
> >> 4. nomerges:
> >>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
> >>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
> >>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
> >>    updated/accessed with bitops which are atomic. So, I believe, protecting
> >>    it with q->sysfs_lock is not necessary.
> >>
> >> 5. nr_requests:
> >>    Write to this attribute updates the tag sets and this could potentially
> >>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
> >>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> >>
> >> 6. read_ahead_kb:
> >>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
> >>    ra_pages is also updated under queue_limits_commit_update() which runs 
> >>    holding q->limits_lock; so I think this attribute file should be protected
> >>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
> >>    Maybe we should move it under the same sets of attribute files which today
> >>    runs with q->limits_lock held.
> >>
> >> 7. rq_affinity:
> >>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
> >>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
> >>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
> >>    protecting it with q->stsys_lock is not necessary.
> >>
> >> 8. scheduler:
> >>    Write to this attribute actually updates q->elevator and the elevator change/switch 
> >>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
> >>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
> >>    q->sysfs_lock protection.
> >>
> >>    However if we're thinking of protecting sched change/update using q->tag_sets->
> >>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
> >>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
> >>    file.
> > 
> > This is one misuse of tag_list_lock, which is supposed to cover host
> > wide change, and shouldn't be used for request queue level protection,
> > which is exactly provided by q->sysfs_lock.
> > 
> Yes I think Christoph was also pointed about the same but then assuming 
> schedule/elevator update would be a rare operation it may not cause
> a lot of contention. Having said that, I'm also fine creating another 
> lock just to protect elevator changes and removing ->sysfs_lock from 
> elevator code.

Then please use new lock.

> 
> > Not mention it will cause ABBA deadlock over freeze lock, please see
> > blk_mq_update_nr_hw_queues(). And it can't be used for protecting
> > 'nr_requests' too.
> I don't know how this might cause ABBA deadlock. The proposal here's to 
> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
> attribute from sysfs as well as while we update the elevator through 
> __blk_mq_update_nr_hw_queues().
> 
> In each code path (either from sysfs attribute update or from nr_hw_queues 
> update), we first acquire ->tag_list_lock and then freeze-lock.
> 
> Do you see any code path where the above order might not be followed?  	

You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal")
has added one such warning:  blk_mq_sysfs_unregister() is called after
queue freeze lock is grabbed from del_gendisk()

Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset()
called after tagset is frozen.

More serious, driver may grab ->tag_list_lock in error recovery code for
providing forward progress, you have to be careful wrt. using ->tag_list_lock,
for example:

	mutex_lock(->tag_list_lock)
	blk_mq_freeze_queue()		// If IO timeout happens, the driver timeout
								// handler stuck on mutex_lock(->tag_list_lock)


Thanks,
Ming
Nilay Shroff Feb. 9, 2025, 1:41 p.m. UTC | #8
On 2/9/25 5:11 PM, Ming Lei wrote:
> On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/8/25 4:11 PM, Ming Lei wrote:
>>> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
>>>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>>>>>> The sysfs attributes are already protected with sysfs/kernfs internal
>>>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>>>>>> attribute files. So this change helps avoid holding q->sysfs_lock while
>>>>>> accessing sysfs attribute files.
>>>>>
>>>>> the sysfs/kernfs locking only protects against other accesses using
>>>>> sysfs.  But that's not really the most interesting part here.  We
>>>>> also want to make sure nothing changes underneath in a way that
>>>>> could cause crashes (and maybe even torn information).
>>>>>
>>>>> We'll really need to audit what is accessed in each method and figure
>>>>> out what protects it.  Chances are that sysfs_lock provides that
>>>>> protection in some case right now, and chances are also very high
>>>>> that a lot of this is pretty broken.
>>>>>
>>>> Yes that's possible and so I audited all sysfs attributes which are 
>>>> currently protected using q->sysfs_lock and I found some interesting
>>>> facts. Please find below:
>>>>
>>>> 1. io_poll:
>>>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>>>
>>>> 2. io_poll_delay:
>>>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
>>>>
>>>> 3. io_timeout:
>>>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>>>    even before disk is added. So that means that we may not need to protect
>>>>    it with q->sysfs_lock.
>>>>
>>>> 4. nomerges:
>>>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>>>    it with q->sysfs_lock is not necessary.
>>>>
>>>> 5. nr_requests:
>>>>    Write to this attribute updates the tag sets and this could potentially
>>>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
>>>>
>>>> 6. read_ahead_kb:
>>>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>>>    holding q->limits_lock; so I think this attribute file should be protected
>>>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>>>    Maybe we should move it under the same sets of attribute files which today
>>>>    runs with q->limits_lock held.
>>>>
>>>> 7. rq_affinity:
>>>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>>>    protecting it with q->stsys_lock is not necessary.
>>>>
>>>> 8. scheduler:
>>>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>>>    q->sysfs_lock protection.
>>>>
>>>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>>>    file.
>>>
>>> This is one misuse of tag_list_lock, which is supposed to cover host
>>> wide change, and shouldn't be used for request queue level protection,
>>> which is exactly provided by q->sysfs_lock.
>>>
>> Yes I think Christoph was also pointed about the same but then assuming 
>> schedule/elevator update would be a rare operation it may not cause
>> a lot of contention. Having said that, I'm also fine creating another 
>> lock just to protect elevator changes and removing ->sysfs_lock from 
>> elevator code.
> 
> Then please use new lock.
Okay, I will replace q->sysfs_lock with another dedicated lock for synchronizing
elevator switch and nr_hw_queue update and that would help eliminate dependency 
between the q->q_usage_counter(io) (or freeze-lock) and the q->sysfs_lock. 
> 
>>
>>> Not mention it will cause ABBA deadlock over freeze lock, please see
>>> blk_mq_update_nr_hw_queues(). And it can't be used for protecting
>>> 'nr_requests' too.
>> I don't know how this might cause ABBA deadlock. The proposal here's to 
>> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
>> attribute from sysfs as well as while we update the elevator through 
>> __blk_mq_update_nr_hw_queues().
>>
>> In each code path (either from sysfs attribute update or from nr_hw_queues 
>> update), we first acquire ->tag_list_lock and then freeze-lock.
>>
>> Do you see any code path where the above order might not be followed?  	
> 
> You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal")
> has added one such warning:  blk_mq_sysfs_unregister() is called after
> queue freeze lock is grabbed from del_gendisk()
> 
> Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset()
> called after tagset is frozen.
> 
> More serious, driver may grab ->tag_list_lock in error recovery code for
> providing forward progress, you have to be careful wrt. using ->tag_list_lock,
> for example:
> 
> 	mutex_lock(->tag_list_lock)
> 	blk_mq_freeze_queue()		// If IO timeout happens, the driver timeout
> 								// handler stuck on mutex_lock(->tag_list_lock)

Ok got it! But lets wait for a bit if Christoph or others have any further comment before
I start making this change.

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3feeeccf8a99..da53397d99fa 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -52,7 +52,6 @@  static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj,
 	struct blk_mq_hw_ctx_sysfs_entry *entry;
 	struct blk_mq_hw_ctx *hctx;
 	struct request_queue *q;
-	ssize_t res;
 
 	entry = container_of(attr, struct blk_mq_hw_ctx_sysfs_entry, attr);
 	hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj);
@@ -61,10 +60,7 @@  static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj,
 	if (!entry->show)
 		return -EIO;
 
-	mutex_lock(&q->sysfs_lock);
-	res = entry->show(hctx, page);
-	mutex_unlock(&q->sysfs_lock);
-	return res;
+	return entry->show(hctx, page);
 }
 
 static ssize_t blk_mq_hw_sysfs_nr_tags_show(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6f548a4376aa..2b8e7b311c61 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -664,14 +664,11 @@  queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 {
 	struct queue_sysfs_entry *entry = to_queue(attr);
 	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
-	ssize_t res;
 
 	if (!entry->show)
 		return -EIO;
-	mutex_lock(&disk->queue->sysfs_lock);
-	res = entry->show(disk, page);
-	mutex_unlock(&disk->queue->sysfs_lock);
-	return res;
+
+	return entry->show(disk, page);
 }
 
 static ssize_t
@@ -710,11 +707,10 @@  queue_attr_store(struct kobject *kobj, struct attribute *attr,
 		return length;
 	}
 
-	mutex_lock(&q->sysfs_lock);
 	memflags = blk_mq_freeze_queue(q);
 	res = entry->store(disk, page, length);
 	blk_mq_unfreeze_queue(q, memflags);
-	mutex_unlock(&q->sysfs_lock);
+
 	return res;
 }