Message ID | 20250205144506.663819-3-nilay@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: fix lock order and remove redundant locking | expand |
On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: > The sysfs attributes are already protected with sysfs/kernfs internal > locking. So acquiring q->sysfs_lock is not needed while accessing sysfs > attribute files. So this change helps avoid holding q->sysfs_lock while > accessing sysfs attribute files. the sysfs/kernfs locking only protects against other accesses using sysfs. But that's not really the most interesting part here. We also want to make sure nothing changes underneath in a way that could cause crashes (and maybe even torn information). We'll really need to audit what is accessed in each method and figure out what protects it. Chances are that sysfs_lock provides that protection in some case right now, and chances are also very high that a lot of this is pretty broken.
On 2/5/25 9:23 PM, Christoph Hellwig wrote: > On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: >> The sysfs attributes are already protected with sysfs/kernfs internal >> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs >> attribute files. So this change helps avoid holding q->sysfs_lock while >> accessing sysfs attribute files. > > the sysfs/kernfs locking only protects against other accesses using > sysfs. But that's not really the most interesting part here. We > also want to make sure nothing changes underneath in a way that > could cause crashes (and maybe even torn information). > > We'll really need to audit what is accessed in each method and figure > out what protects it. Chances are that sysfs_lock provides that > protection in some case right now, and chances are also very high > that a lot of this is pretty broken. > Yes that's possible and so I audited all sysfs attributes which are currently protected using q->sysfs_lock and I found some interesting facts. Please find below: 1. io_poll: Write to this attribute is ignored. So, we don't need q->sysfs_lock. 2. io_poll_delay: Write to this attribute is NOP, so we don't need q->sysfs_lock. 3. io_timeout: Write to this attribute updates q->rq_timeout and read of this attribute returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is set only once when we init the queue (under blk_mq_init_allocated_queue()) even before disk is added. So that means that we may not need to protect it with q->sysfs_lock. 4. nomerges: Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are updated/accessed with bitops which are atomic. So, I believe, protecting it with q->sysfs_lock is not necessary. 5. nr_requests: Write to this attribute updates the tag sets and this could potentially race with __blk_mq_update_nr_hw_queues(). So I think we should really protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. 6. read_ahead_kb: Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> ra_pages is also updated under queue_limits_commit_update() which runs holding q->limits_lock; so I think this attribute file should be protected with q->limits_lock and protecting it with q->sysfs_lock is not necessary. Maybe we should move it under the same sets of attribute files which today runs with q->limits_lock held. 7. rq_affinity: Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() using test_bit macro. As read/write to q->flags uses bitops which are atomic, protecting it with q->stsys_lock is not necessary. 8. scheduler: Write to this attribute actually updates q->elevator and the elevator change/switch code expects that the q->sysfs_lock is held while we update the iosched to protect against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs q->sysfs_lock protection. However if we're thinking of protecting sched change/update using q->tag_sets-> tag_list_lock (as discussed in another thread), then we may use q->tag_set-> tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute file. 9. wbt_lat_usec: Write to this attribute file updates the various wbt limits and state. This may race with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock and so yes, we need to protect this attribute with q->sysfs_lock. However, as mentioned above, if we're thinking of protecting elevator change/update using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of q->sysfs_lock while reading/writing to this attribute file. So yes, you've rightly guessed few of the above attributes are not well protected and few still may require sysfs_lock protection. From the above list, I see that the "read_ahead_kb" should be protected with q->limits_lock. The "nr_requests" should be protected with q->tag_set->tag_list_lock instead of q->sysfs_lock. The "scheduler" and "wbt_lat_usec" require protection using either q->sysfs_lock or q->tag_set->tag_list_lock. The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and those could be well protected with the sysfs/kernfs internal lock. Thanks, --Nilay
On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: > Yes that's possible and so I audited all sysfs attributes which are > currently protected using q->sysfs_lock and I found some interesting > facts. Please find below: > > 1. io_poll: > Write to this attribute is ignored. So, we don't need q->sysfs_lock. > > 2. io_poll_delay: > Write to this attribute is NOP, so we don't need q->sysfs_lock. Yes, those are easy. > 3. io_timeout: > Write to this attribute updates q->rq_timeout and read of this attribute > returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is > set only once when we init the queue (under blk_mq_init_allocated_queue()) > even before disk is added. So that means that we may not need to protect > it with q->sysfs_lock. Are we sure blk_queue_rq_timeout is never called from anything but probe? Either way given that this is a limit that isn't directly corelated with anything else simply using WRITE_ONCE/READ_ONCE might be enough. > > 4. nomerges: > Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES > and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which > anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are > updated/accessed with bitops which are atomic. So, I believe, protecting > it with q->sysfs_lock is not necessary. Yes. > 5. nr_requests: > Write to this attribute updates the tag sets and this could potentially > race with __blk_mq_update_nr_hw_queues(). So I think we should really > protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. Yeah. > 6. read_ahead_kb: > Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> > ra_pages is also updated under queue_limits_commit_update() which runs > holding q->limits_lock; so I think this attribute file should be protected > with q->limits_lock and protecting it with q->sysfs_lock is not necessary. > Maybe we should move it under the same sets of attribute files which today > runs with q->limits_lock held. Yes, limits_lock sounds sensible here. > 7. rq_affinity: > Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP > and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() > using test_bit macro. As read/write to q->flags uses bitops which are atomic, > protecting it with q->stsys_lock is not necessary. Agreed. Although updating both flags isn't atomic that should be harmless in this case (but could use a comment about that). > 8. scheduler: > Write to this attribute actually updates q->elevator and the elevator change/switch > code expects that the q->sysfs_lock is held while we update the iosched to protect > against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs > q->sysfs_lock protection. > > However if we're thinking of protecting sched change/update using q->tag_sets-> > tag_list_lock (as discussed in another thread), then we may use q->tag_set-> > tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute > file. Yes. > 9. wbt_lat_usec: > Write to this attribute file updates the various wbt limits and state. This may race > with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the > blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock > and so yes, we need to protect this attribute with q->sysfs_lock. > > However, as mentioned above, if we're thinking of protecting elevator change/update > using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of > q->sysfs_lock while reading/writing to this attribute file. Yes. > The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and > those could be well protected with the sysfs/kernfs internal lock. So I think the first step here is to remove the locking from queue_attr_store/queue_attr_show and move it into the attributes that still need it in some form, followed by replacing it with the more suitable locks as defined above. And maybe with a little bit more work we can remove sysfs_lock entirely eventually.
On 2/6/25 7:37 PM, Christoph Hellwig wrote: > On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: >> Yes that's possible and so I audited all sysfs attributes which are >> currently protected using q->sysfs_lock and I found some interesting >> facts. Please find below: >> >> 1. io_poll: >> Write to this attribute is ignored. So, we don't need q->sysfs_lock. >> >> 2. io_poll_delay: >> Write to this attribute is NOP, so we don't need q->sysfs_lock. > > Yes, those are easy. > >> 3. io_timeout: >> Write to this attribute updates q->rq_timeout and read of this attribute >> returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is >> set only once when we init the queue (under blk_mq_init_allocated_queue()) >> even before disk is added. So that means that we may not need to protect >> it with q->sysfs_lock. > > Are we sure blk_queue_rq_timeout is never called from anything but > probe? Either way given that this is a limit that isn't directly > corelated with anything else simply using WRITE_ONCE/READ_ONCE might > be enough. I just again grep source code and confirmed that blk_queue_rq_timeout is mostly called from the driver probe method (barring nbd driver which may set q->rq_timeout using ioctl). And yes, agreed, q->rq_timeout can be accessed using WRITE_ONCE/READ_ONCE. >> >> 4. nomerges: >> Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES >> and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which >> anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are >> updated/accessed with bitops which are atomic. So, I believe, protecting >> it with q->sysfs_lock is not necessary. > > Yes. > >> 5. nr_requests: >> Write to this attribute updates the tag sets and this could potentially >> race with __blk_mq_update_nr_hw_queues(). So I think we should really >> protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. > > Yeah. > >> 6. read_ahead_kb: >> Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> >> ra_pages is also updated under queue_limits_commit_update() which runs >> holding q->limits_lock; so I think this attribute file should be protected >> with q->limits_lock and protecting it with q->sysfs_lock is not necessary. >> Maybe we should move it under the same sets of attribute files which today >> runs with q->limits_lock held. > > Yes, limits_lock sounds sensible here. > >> 7. rq_affinity: >> Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP >> and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() >> using test_bit macro. As read/write to q->flags uses bitops which are atomic, >> protecting it with q->stsys_lock is not necessary. > > Agreed. Although updating both flags isn't atomic that should be > harmless in this case (but could use a comment about that). Sure will add comment about this. > >> 8. scheduler: >> Write to this attribute actually updates q->elevator and the elevator change/switch >> code expects that the q->sysfs_lock is held while we update the iosched to protect >> against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs >> q->sysfs_lock protection. >> >> However if we're thinking of protecting sched change/update using q->tag_sets-> >> tag_list_lock (as discussed in another thread), then we may use q->tag_set-> >> tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute >> file. > > Yes. > >> 9. wbt_lat_usec: >> Write to this attribute file updates the various wbt limits and state. This may race >> with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the >> blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock >> and so yes, we need to protect this attribute with q->sysfs_lock. >> >> However, as mentioned above, if we're thinking of protecting elevator change/update >> using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of >> q->sysfs_lock while reading/writing to this attribute file. > > Yes. > >> The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and >> those could be well protected with the sysfs/kernfs internal lock. > > So I think the first step here is to remove the locking from > queue_attr_store/queue_attr_show and move it into the attributes that > still need it in some form, followed by replacing it with the more > suitable locks as defined above. And maybe with a little bit more > work we can remove sysfs_lock entirely eventually. Yes this sounds like a good plan! Thank you Christoph for your review! And as you suggested, in another thread, I'd wait some more time for Jens and Ming, if in case they have any further comments about this change. Thanks, --Nilay
On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: > > > On 2/5/25 9:23 PM, Christoph Hellwig wrote: > > On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: > >> The sysfs attributes are already protected with sysfs/kernfs internal > >> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs > >> attribute files. So this change helps avoid holding q->sysfs_lock while > >> accessing sysfs attribute files. > > > > the sysfs/kernfs locking only protects against other accesses using > > sysfs. But that's not really the most interesting part here. We > > also want to make sure nothing changes underneath in a way that > > could cause crashes (and maybe even torn information). > > > > We'll really need to audit what is accessed in each method and figure > > out what protects it. Chances are that sysfs_lock provides that > > protection in some case right now, and chances are also very high > > that a lot of this is pretty broken. > > > Yes that's possible and so I audited all sysfs attributes which are > currently protected using q->sysfs_lock and I found some interesting > facts. Please find below: > > 1. io_poll: > Write to this attribute is ignored. So, we don't need q->sysfs_lock. > > 2. io_poll_delay: > Write to this attribute is NOP, so we don't need q->sysfs_lock. > > 3. io_timeout: > Write to this attribute updates q->rq_timeout and read of this attribute > returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is > set only once when we init the queue (under blk_mq_init_allocated_queue()) > even before disk is added. So that means that we may not need to protect > it with q->sysfs_lock. > > 4. nomerges: > Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES > and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which > anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are > updated/accessed with bitops which are atomic. So, I believe, protecting > it with q->sysfs_lock is not necessary. > > 5. nr_requests: > Write to this attribute updates the tag sets and this could potentially > race with __blk_mq_update_nr_hw_queues(). So I think we should really > protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. > > 6. read_ahead_kb: > Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> > ra_pages is also updated under queue_limits_commit_update() which runs > holding q->limits_lock; so I think this attribute file should be protected > with q->limits_lock and protecting it with q->sysfs_lock is not necessary. > Maybe we should move it under the same sets of attribute files which today > runs with q->limits_lock held. > > 7. rq_affinity: > Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP > and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() > using test_bit macro. As read/write to q->flags uses bitops which are atomic, > protecting it with q->stsys_lock is not necessary. > > 8. scheduler: > Write to this attribute actually updates q->elevator and the elevator change/switch > code expects that the q->sysfs_lock is held while we update the iosched to protect > against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs > q->sysfs_lock protection. > > However if we're thinking of protecting sched change/update using q->tag_sets-> > tag_list_lock (as discussed in another thread), then we may use q->tag_set-> > tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute > file. This is one misuse of tag_list_lock, which is supposed to cover host wide change, and shouldn't be used for request queue level protection, which is exactly provided by q->sysfs_lock. Not mention it will cause ABBA deadlock over freeze lock, please see blk_mq_update_nr_hw_queues(). And it can't be used for protecting 'nr_requests' too. > > 9. wbt_lat_usec: > Write to this attribute file updates the various wbt limits and state. This may race > with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the > blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock > and so yes, we need to protect this attribute with q->sysfs_lock. > > However, as mentioned above, if we're thinking of protecting elevator change/update > using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of > q->sysfs_lock while reading/writing to this attribute file. Same ABBA deadlock with above. Thanks, Ming
On 2/8/25 4:11 PM, Ming Lei wrote: > On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: >> >> >> On 2/5/25 9:23 PM, Christoph Hellwig wrote: >>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: >>>> The sysfs attributes are already protected with sysfs/kernfs internal >>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs >>>> attribute files. So this change helps avoid holding q->sysfs_lock while >>>> accessing sysfs attribute files. >>> >>> the sysfs/kernfs locking only protects against other accesses using >>> sysfs. But that's not really the most interesting part here. We >>> also want to make sure nothing changes underneath in a way that >>> could cause crashes (and maybe even torn information). >>> >>> We'll really need to audit what is accessed in each method and figure >>> out what protects it. Chances are that sysfs_lock provides that >>> protection in some case right now, and chances are also very high >>> that a lot of this is pretty broken. >>> >> Yes that's possible and so I audited all sysfs attributes which are >> currently protected using q->sysfs_lock and I found some interesting >> facts. Please find below: >> >> 1. io_poll: >> Write to this attribute is ignored. So, we don't need q->sysfs_lock. >> >> 2. io_poll_delay: >> Write to this attribute is NOP, so we don't need q->sysfs_lock. >> >> 3. io_timeout: >> Write to this attribute updates q->rq_timeout and read of this attribute >> returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is >> set only once when we init the queue (under blk_mq_init_allocated_queue()) >> even before disk is added. So that means that we may not need to protect >> it with q->sysfs_lock. >> >> 4. nomerges: >> Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES >> and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which >> anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are >> updated/accessed with bitops which are atomic. So, I believe, protecting >> it with q->sysfs_lock is not necessary. >> >> 5. nr_requests: >> Write to this attribute updates the tag sets and this could potentially >> race with __blk_mq_update_nr_hw_queues(). So I think we should really >> protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. >> >> 6. read_ahead_kb: >> Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> >> ra_pages is also updated under queue_limits_commit_update() which runs >> holding q->limits_lock; so I think this attribute file should be protected >> with q->limits_lock and protecting it with q->sysfs_lock is not necessary. >> Maybe we should move it under the same sets of attribute files which today >> runs with q->limits_lock held. >> >> 7. rq_affinity: >> Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP >> and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() >> using test_bit macro. As read/write to q->flags uses bitops which are atomic, >> protecting it with q->stsys_lock is not necessary. >> >> 8. scheduler: >> Write to this attribute actually updates q->elevator and the elevator change/switch >> code expects that the q->sysfs_lock is held while we update the iosched to protect >> against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs >> q->sysfs_lock protection. >> >> However if we're thinking of protecting sched change/update using q->tag_sets-> >> tag_list_lock (as discussed in another thread), then we may use q->tag_set-> >> tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute >> file. > > This is one misuse of tag_list_lock, which is supposed to cover host > wide change, and shouldn't be used for request queue level protection, > which is exactly provided by q->sysfs_lock. > Yes I think Christoph was also pointed about the same but then assuming schedule/elevator update would be a rare operation it may not cause a lot of contention. Having said that, I'm also fine creating another lock just to protect elevator changes and removing ->sysfs_lock from elevator code. > Not mention it will cause ABBA deadlock over freeze lock, please see > blk_mq_update_nr_hw_queues(). And it can't be used for protecting > 'nr_requests' too. I don't know how this might cause ABBA deadlock. The proposal here's to use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler attribute from sysfs as well as while we update the elevator through __blk_mq_update_nr_hw_queues(). In each code path (either from sysfs attribute update or from nr_hw_queues update), we first acquire ->tag_list_lock and then freeze-lock. Do you see any code path where the above order might not be followed? Thanks, --Nilay
On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote: > > > On 2/8/25 4:11 PM, Ming Lei wrote: > > On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: > >> > >> > >> On 2/5/25 9:23 PM, Christoph Hellwig wrote: > >>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: > >>>> The sysfs attributes are already protected with sysfs/kernfs internal > >>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs > >>>> attribute files. So this change helps avoid holding q->sysfs_lock while > >>>> accessing sysfs attribute files. > >>> > >>> the sysfs/kernfs locking only protects against other accesses using > >>> sysfs. But that's not really the most interesting part here. We > >>> also want to make sure nothing changes underneath in a way that > >>> could cause crashes (and maybe even torn information). > >>> > >>> We'll really need to audit what is accessed in each method and figure > >>> out what protects it. Chances are that sysfs_lock provides that > >>> protection in some case right now, and chances are also very high > >>> that a lot of this is pretty broken. > >>> > >> Yes that's possible and so I audited all sysfs attributes which are > >> currently protected using q->sysfs_lock and I found some interesting > >> facts. Please find below: > >> > >> 1. io_poll: > >> Write to this attribute is ignored. So, we don't need q->sysfs_lock. > >> > >> 2. io_poll_delay: > >> Write to this attribute is NOP, so we don't need q->sysfs_lock. > >> > >> 3. io_timeout: > >> Write to this attribute updates q->rq_timeout and read of this attribute > >> returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is > >> set only once when we init the queue (under blk_mq_init_allocated_queue()) > >> even before disk is added. So that means that we may not need to protect > >> it with q->sysfs_lock. > >> > >> 4. nomerges: > >> Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES > >> and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which > >> anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are > >> updated/accessed with bitops which are atomic. So, I believe, protecting > >> it with q->sysfs_lock is not necessary. > >> > >> 5. nr_requests: > >> Write to this attribute updates the tag sets and this could potentially > >> race with __blk_mq_update_nr_hw_queues(). So I think we should really > >> protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. > >> > >> 6. read_ahead_kb: > >> Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> > >> ra_pages is also updated under queue_limits_commit_update() which runs > >> holding q->limits_lock; so I think this attribute file should be protected > >> with q->limits_lock and protecting it with q->sysfs_lock is not necessary. > >> Maybe we should move it under the same sets of attribute files which today > >> runs with q->limits_lock held. > >> > >> 7. rq_affinity: > >> Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP > >> and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() > >> using test_bit macro. As read/write to q->flags uses bitops which are atomic, > >> protecting it with q->stsys_lock is not necessary. > >> > >> 8. scheduler: > >> Write to this attribute actually updates q->elevator and the elevator change/switch > >> code expects that the q->sysfs_lock is held while we update the iosched to protect > >> against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs > >> q->sysfs_lock protection. > >> > >> However if we're thinking of protecting sched change/update using q->tag_sets-> > >> tag_list_lock (as discussed in another thread), then we may use q->tag_set-> > >> tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute > >> file. > > > > This is one misuse of tag_list_lock, which is supposed to cover host > > wide change, and shouldn't be used for request queue level protection, > > which is exactly provided by q->sysfs_lock. > > > Yes I think Christoph was also pointed about the same but then assuming > schedule/elevator update would be a rare operation it may not cause > a lot of contention. Having said that, I'm also fine creating another > lock just to protect elevator changes and removing ->sysfs_lock from > elevator code. Then please use new lock. > > > Not mention it will cause ABBA deadlock over freeze lock, please see > > blk_mq_update_nr_hw_queues(). And it can't be used for protecting > > 'nr_requests' too. > I don't know how this might cause ABBA deadlock. The proposal here's to > use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler > attribute from sysfs as well as while we update the elevator through > __blk_mq_update_nr_hw_queues(). > > In each code path (either from sysfs attribute update or from nr_hw_queues > update), we first acquire ->tag_list_lock and then freeze-lock. > > Do you see any code path where the above order might not be followed? You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal") has added one such warning: blk_mq_sysfs_unregister() is called after queue freeze lock is grabbed from del_gendisk() Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset() called after tagset is frozen. More serious, driver may grab ->tag_list_lock in error recovery code for providing forward progress, you have to be careful wrt. using ->tag_list_lock, for example: mutex_lock(->tag_list_lock) blk_mq_freeze_queue() // If IO timeout happens, the driver timeout // handler stuck on mutex_lock(->tag_list_lock) Thanks, Ming
On 2/9/25 5:11 PM, Ming Lei wrote: > On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote: >> >> >> On 2/8/25 4:11 PM, Ming Lei wrote: >>> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote: >>>> >>>> >>>> On 2/5/25 9:23 PM, Christoph Hellwig wrote: >>>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote: >>>>>> The sysfs attributes are already protected with sysfs/kernfs internal >>>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs >>>>>> attribute files. So this change helps avoid holding q->sysfs_lock while >>>>>> accessing sysfs attribute files. >>>>> >>>>> the sysfs/kernfs locking only protects against other accesses using >>>>> sysfs. But that's not really the most interesting part here. We >>>>> also want to make sure nothing changes underneath in a way that >>>>> could cause crashes (and maybe even torn information). >>>>> >>>>> We'll really need to audit what is accessed in each method and figure >>>>> out what protects it. Chances are that sysfs_lock provides that >>>>> protection in some case right now, and chances are also very high >>>>> that a lot of this is pretty broken. >>>>> >>>> Yes that's possible and so I audited all sysfs attributes which are >>>> currently protected using q->sysfs_lock and I found some interesting >>>> facts. Please find below: >>>> >>>> 1. io_poll: >>>> Write to this attribute is ignored. So, we don't need q->sysfs_lock. >>>> >>>> 2. io_poll_delay: >>>> Write to this attribute is NOP, so we don't need q->sysfs_lock. >>>> >>>> 3. io_timeout: >>>> Write to this attribute updates q->rq_timeout and read of this attribute >>>> returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is >>>> set only once when we init the queue (under blk_mq_init_allocated_queue()) >>>> even before disk is added. So that means that we may not need to protect >>>> it with q->sysfs_lock. >>>> >>>> 4. nomerges: >>>> Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES >>>> and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which >>>> anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are >>>> updated/accessed with bitops which are atomic. So, I believe, protecting >>>> it with q->sysfs_lock is not necessary. >>>> >>>> 5. nr_requests: >>>> Write to this attribute updates the tag sets and this could potentially >>>> race with __blk_mq_update_nr_hw_queues(). So I think we should really >>>> protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock. >>>> >>>> 6. read_ahead_kb: >>>> Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi-> >>>> ra_pages is also updated under queue_limits_commit_update() which runs >>>> holding q->limits_lock; so I think this attribute file should be protected >>>> with q->limits_lock and protecting it with q->sysfs_lock is not necessary. >>>> Maybe we should move it under the same sets of attribute files which today >>>> runs with q->limits_lock held. >>>> >>>> 7. rq_affinity: >>>> Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP >>>> and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() >>>> using test_bit macro. As read/write to q->flags uses bitops which are atomic, >>>> protecting it with q->stsys_lock is not necessary. >>>> >>>> 8. scheduler: >>>> Write to this attribute actually updates q->elevator and the elevator change/switch >>>> code expects that the q->sysfs_lock is held while we update the iosched to protect >>>> against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs >>>> q->sysfs_lock protection. >>>> >>>> However if we're thinking of protecting sched change/update using q->tag_sets-> >>>> tag_list_lock (as discussed in another thread), then we may use q->tag_set-> >>>> tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute >>>> file. >>> >>> This is one misuse of tag_list_lock, which is supposed to cover host >>> wide change, and shouldn't be used for request queue level protection, >>> which is exactly provided by q->sysfs_lock. >>> >> Yes I think Christoph was also pointed about the same but then assuming >> schedule/elevator update would be a rare operation it may not cause >> a lot of contention. Having said that, I'm also fine creating another >> lock just to protect elevator changes and removing ->sysfs_lock from >> elevator code. > > Then please use new lock. Okay, I will replace q->sysfs_lock with another dedicated lock for synchronizing elevator switch and nr_hw_queue update and that would help eliminate dependency between the q->q_usage_counter(io) (or freeze-lock) and the q->sysfs_lock. > >> >>> Not mention it will cause ABBA deadlock over freeze lock, please see >>> blk_mq_update_nr_hw_queues(). And it can't be used for protecting >>> 'nr_requests' too. >> I don't know how this might cause ABBA deadlock. The proposal here's to >> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler >> attribute from sysfs as well as while we update the elevator through >> __blk_mq_update_nr_hw_queues(). >> >> In each code path (either from sysfs attribute update or from nr_hw_queues >> update), we first acquire ->tag_list_lock and then freeze-lock. >> >> Do you see any code path where the above order might not be followed? > > You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal") > has added one such warning: blk_mq_sysfs_unregister() is called after > queue freeze lock is grabbed from del_gendisk() > > Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset() > called after tagset is frozen. > > More serious, driver may grab ->tag_list_lock in error recovery code for > providing forward progress, you have to be careful wrt. using ->tag_list_lock, > for example: > > mutex_lock(->tag_list_lock) > blk_mq_freeze_queue() // If IO timeout happens, the driver timeout > // handler stuck on mutex_lock(->tag_list_lock) Ok got it! But lets wait for a bit if Christoph or others have any further comment before I start making this change. Thanks, --Nilay
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3feeeccf8a99..da53397d99fa 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -52,7 +52,6 @@ static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj, struct blk_mq_hw_ctx_sysfs_entry *entry; struct blk_mq_hw_ctx *hctx; struct request_queue *q; - ssize_t res; entry = container_of(attr, struct blk_mq_hw_ctx_sysfs_entry, attr); hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); @@ -61,10 +60,7 @@ static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj, if (!entry->show) return -EIO; - mutex_lock(&q->sysfs_lock); - res = entry->show(hctx, page); - mutex_unlock(&q->sysfs_lock); - return res; + return entry->show(hctx, page); } static ssize_t blk_mq_hw_sysfs_nr_tags_show(struct blk_mq_hw_ctx *hctx, diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 6f548a4376aa..2b8e7b311c61 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -664,14 +664,11 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page) { struct queue_sysfs_entry *entry = to_queue(attr); struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); - ssize_t res; if (!entry->show) return -EIO; - mutex_lock(&disk->queue->sysfs_lock); - res = entry->show(disk, page); - mutex_unlock(&disk->queue->sysfs_lock); - return res; + + return entry->show(disk, page); } static ssize_t @@ -710,11 +707,10 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, return length; } - mutex_lock(&q->sysfs_lock); memflags = blk_mq_freeze_queue(q); res = entry->store(disk, page, length); blk_mq_unfreeze_queue(q, memflags); - mutex_unlock(&q->sysfs_lock); + return res; }
The sysfs attributes are already protected with sysfs/kernfs internal locking. So acquiring q->sysfs_lock is not needed while accessing sysfs attribute files. So this change helps avoid holding q->sysfs_lock while accessing sysfs attribute files. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- block/blk-mq-sysfs.c | 6 +----- block/blk-sysfs.c | 10 +++------- 2 files changed, 4 insertions(+), 12 deletions(-)