diff mbox series

[1/1] blk-mq: fix race condition in active queue accounting

Message ID 20230522004328.760024-1-tilan7663@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/1] blk-mq: fix race condition in active queue accounting | expand

Commit Message

Tian Lan May 22, 2023, 12:43 a.m. UTC
From: Tian Lan <tian.lan@twosigma.com>

If multiple CPUs are sharing the same hardware queue, it can
cause leak in the active queue counter tracking when __blk_mq_tag_busy()
is executed simultaneously.

Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
---
 block/blk-mq-tag.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Ming Lei May 22, 2023, 1:08 a.m. UTC | #1
On Mon, May 22, 2023 at 8:45 AM Tian Lan <tilan7663@gmail.com> wrote:
>
> From: Tian Lan <tian.lan@twosigma.com>
>
> If multiple CPUs are sharing the same hardware queue, it can
> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
> is executed simultaneously.
>
> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
> Signed-off-by: Tian Lan <tian.lan@twosigma.com>

Looks fine,

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Damien Le Moal May 22, 2023, 1:15 a.m. UTC | #2
On 5/22/23 09:43, Tian Lan wrote:
> From: Tian Lan <tian.lan@twosigma.com>
> 
> If multiple CPUs are sharing the same hardware queue, it can
> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
> is executed simultaneously.
> 
> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
> ---
>  block/blk-mq-tag.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d6af9d431dc6..07372032238a 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>  		struct request_queue *q = hctx->queue;
>  
> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {

This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:

		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
			return;

?

>  			return;
> -		set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags);
> +		}
>  	} else {
> -		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
> +		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
>  			return;
> -		set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
> +		}

Same here. And given that this pattern is the same for the if and the else, this
entire hunk can likely be simplified.

>  	}
>  
>  	users = atomic_inc_return(&hctx->tags->active_queues);
Ming Lei May 22, 2023, 1:23 a.m. UTC | #3
On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
> On 5/22/23 09:43, Tian Lan wrote:
> > From: Tian Lan <tian.lan@twosigma.com>
> > 
> > If multiple CPUs are sharing the same hardware queue, it can
> > cause leak in the active queue counter tracking when __blk_mq_tag_busy()
> > is executed simultaneously.
> > 
> > Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
> > Signed-off-by: Tian Lan <tian.lan@twosigma.com>
> > ---
> >  block/blk-mq-tag.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index d6af9d431dc6..07372032238a 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
> >  	if (blk_mq_is_shared_tags(hctx->flags)) {
> >  		struct request_queue *q = hctx->queue;
> >  
> > -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> > +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> > +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> 
> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
> 
> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> 			return;
> 
> ?

It is one micro optimization since test_and_set_bit is much heavier than
test_bit, so test_and_set_bit() is just needed in the 1st time.

Thanks,
Ming
Damien Le Moal May 22, 2023, 1:29 a.m. UTC | #4
On 5/22/23 10:23, Ming Lei wrote:
> On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
>> On 5/22/23 09:43, Tian Lan wrote:
>>> From: Tian Lan <tian.lan@twosigma.com>
>>>
>>> If multiple CPUs are sharing the same hardware queue, it can
>>> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
>>> is executed simultaneously.
>>>
>>> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
>>> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
>>> ---
>>>  block/blk-mq-tag.c | 10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index d6af9d431dc6..07372032238a 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>>>  		struct request_queue *q = hctx->queue;
>>>  
>>> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>
>> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
>>
>> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>> 			return;
>>
>> ?
> 
> It is one micro optimization since test_and_set_bit is much heavier than
> test_bit, so test_and_set_bit() is just needed in the 1st time.

But having the 2 calls test_bit + test_and_set_bit creates a race, doesn't it ?
What if another cpu clears the bit between these 2 calls ?

At the very least, the patch should remove the curly braces around that if.

> 
> Thanks,
> Ming
>
Jens Axboe May 22, 2023, 2:05 a.m. UTC | #5
On 5/21/23 7:23?PM, Ming Lei wrote:
> On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
>> On 5/22/23 09:43, Tian Lan wrote:
>>> From: Tian Lan <tian.lan@twosigma.com>
>>>
>>> If multiple CPUs are sharing the same hardware queue, it can
>>> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
>>> is executed simultaneously.
>>>
>>> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
>>> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
>>> ---
>>>  block/blk-mq-tag.c | 10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index d6af9d431dc6..07372032238a 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>>>  		struct request_queue *q = hctx->queue;
>>>  
>>> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>
>> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
>>
>> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>> 			return;
>>
>> ?
> 
> It is one micro optimization since test_and_set_bit is much heavier
> than test_bit, so test_and_set_bit() is just needed in the 1st time.

It's an optimization, but it's certainly not a micro one. If the common
case is always hitting that, then test_and_set_bit() will repeatedly
dirty that cacheline. And obviously it's useless to do that if the bit
is already set. This makes it a pretty nice optimization and definitely
outside the realm of "micro optimization" as it can have quite a large
impact. I used that in various spots in blk-mq, which I suspect is where
the inspiration for this one came too.
Jens Axboe May 22, 2023, 2:07 a.m. UTC | #6
On 5/21/23 7:29?PM, Damien Le Moal wrote:
> On 5/22/23 10:23, Ming Lei wrote:
>> On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
>>> On 5/22/23 09:43, Tian Lan wrote:
>>>> From: Tian Lan <tian.lan@twosigma.com>
>>>>
>>>> If multiple CPUs are sharing the same hardware queue, it can
>>>> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
>>>> is executed simultaneously.
>>>>
>>>> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
>>>> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
>>>> ---
>>>>  block/blk-mq-tag.c | 10 ++++++----
>>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>> index d6af9d431dc6..07372032238a 100644
>>>> --- a/block/blk-mq-tag.c
>>>> +++ b/block/blk-mq-tag.c
>>>> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>>>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>>>>  		struct request_queue *q = hctx->queue;
>>>>  
>>>> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>>
>>> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
>>>
>>> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>> 			return;
>>>
>>> ?
>>
>> It is one micro optimization since test_and_set_bit is much heavier than
>> test_bit, so test_and_set_bit() is just needed in the 1st time.
> 
> But having the 2 calls test_bit + test_and_set_bit creates a race,
> doesn't it ? What if another cpu clears the bit between these 2 calls
> ?

How so? If the bit is already set or you're racing with it being set or
cleared, that race already exists before. This simply prevent
unnecessary dirtying of that cacheline.
Ming Lei May 22, 2023, 2:17 a.m. UTC | #7
On Mon, May 22, 2023 at 10:29:05AM +0900, Damien Le Moal wrote:
> On 5/22/23 10:23, Ming Lei wrote:
> > On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
> >> On 5/22/23 09:43, Tian Lan wrote:
> >>> From: Tian Lan <tian.lan@twosigma.com>
> >>>
> >>> If multiple CPUs are sharing the same hardware queue, it can
> >>> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
> >>> is executed simultaneously.
> >>>
> >>> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
> >>> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
> >>> ---
> >>>  block/blk-mq-tag.c | 10 ++++++----
> >>>  1 file changed, 6 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >>> index d6af9d431dc6..07372032238a 100644
> >>> --- a/block/blk-mq-tag.c
> >>> +++ b/block/blk-mq-tag.c
> >>> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
> >>>  	if (blk_mq_is_shared_tags(hctx->flags)) {
> >>>  		struct request_queue *q = hctx->queue;
> >>>  
> >>> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> >>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> >>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> >>
> >> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
> >>
> >> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> >> 			return;
> >>
> >> ?
> > 
> > It is one micro optimization since test_and_set_bit is much heavier than
> > test_bit, so test_and_set_bit() is just needed in the 1st time.
> 
> But having the 2 calls test_bit + test_and_set_bit creates a race, doesn't it ?
> What if another cpu clears the bit between these 2 calls ?

If test_bit() returns 0, there isn't such race since both sides are atomic OP.

If test_bit() returns 1:

1) __blk_mq_tag_busy() vs. __blk_mq_tag_busy()
- both does nothing

2) __blk_mq_tag_busy() vs. __blk_mq_tag_idle()
- hctx_may_queue() is always following __blk_mq_tag_busy()
- hctx_may_queue() returns true in case that this flag is cleared
- current __blk_mq_tag_busy() does nothing, the following allocation
works fine given hctx_may_queue() returns true
- new __blk_mq_tag_busy() will setup everything as fine


Thanks,
Ming
Damien Le Moal May 22, 2023, 2:26 a.m. UTC | #8
On 5/22/23 11:17, Ming Lei wrote:
> On Mon, May 22, 2023 at 10:29:05AM +0900, Damien Le Moal wrote:
>> On 5/22/23 10:23, Ming Lei wrote:
>>> On Mon, May 22, 2023 at 10:15:22AM +0900, Damien Le Moal wrote:
>>>> On 5/22/23 09:43, Tian Lan wrote:
>>>>> From: Tian Lan <tian.lan@twosigma.com>
>>>>>
>>>>> If multiple CPUs are sharing the same hardware queue, it can
>>>>> cause leak in the active queue counter tracking when __blk_mq_tag_busy()
>>>>> is executed simultaneously.
>>>>>
>>>>> Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
>>>>> Signed-off-by: Tian Lan <tian.lan@twosigma.com>
>>>>> ---
>>>>>  block/blk-mq-tag.c | 10 ++++++----
>>>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>>> index d6af9d431dc6..07372032238a 100644
>>>>> --- a/block/blk-mq-tag.c
>>>>> +++ b/block/blk-mq-tag.c
>>>>> @@ -42,13 +42,15 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>>>>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>>>>>  		struct request_queue *q = hctx->queue;
>>>>>  
>>>>> -		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>>>
>>>> This is weird. test_and_set_bit() returns the bit old value, so shouldn't this be:
>>>>
>>>> 		if (test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>>>> 			return;
>>>>
>>>> ?
>>>
>>> It is one micro optimization since test_and_set_bit is much heavier than
>>> test_bit, so test_and_set_bit() is just needed in the 1st time.
>>
>> But having the 2 calls test_bit + test_and_set_bit creates a race, doesn't it ?
>> What if another cpu clears the bit between these 2 calls ?
> 
> If test_bit() returns 0, there isn't such race since both sides are atomic OP.
> 
> If test_bit() returns 1:
> 
> 1) __blk_mq_tag_busy() vs. __blk_mq_tag_busy()
> - both does nothing
> 
> 2) __blk_mq_tag_busy() vs. __blk_mq_tag_idle()
> - hctx_may_queue() is always following __blk_mq_tag_busy()
> - hctx_may_queue() returns true in case that this flag is cleared
> - current __blk_mq_tag_busy() does nothing, the following allocation
> works fine given hctx_may_queue() returns true
> - new __blk_mq_tag_busy() will setup everything as fine

OK. Thanks. It would be nice to update the function comment (which is not very
clear due to grammar issues) to document this unusual use of the bit functions,
including the fact that it is an optimization to avoid dirtying cache lines.

> 
> 
> Thanks,
> Ming
>
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d6af9d431dc6..07372032238a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -42,13 +42,15 @@  void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
 
-		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
+		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
+		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
 			return;
-		set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags);
+		}
 	} else {
-		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
+		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
 			return;
-		set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
+		}
 	}
 
 	users = atomic_inc_return(&hctx->tags->active_queues);