diff mbox series

[-next,v4] blk-mq: fix tag_get wait task can't be awakened

Message ID 20220111140216.1858823-1-qiulaibin@huawei.com (mailing list archive)
State New, archived
Headers show
Series [-next,v4] blk-mq: fix tag_get wait task can't be awakened | expand

Commit Message

QiuLaibin Jan. 11, 2022, 2:02 p.m. UTC
In case of shared tags, there might be more than one hctx which
allocates from the same tags, and each hctx is limited to allocate at
most:
        hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);

tag idle detection is lazy, and may be delayed for 30sec, so there
could be just one real active hctx(queue) but all others are actually
idle and still accounted as active because of the lazy idle detection.
Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
forever on this real active hctx.

Fix this by recalculating wake_batch when inc or dec active_queues.

Fixes: 0d2602ca30e41 ("blk-mq: improve support for shared tags maps")
Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
---
 block/blk-mq-tag.c      | 39 ++++++++++++++++++++++++++++++++-------
 include/linux/sbitmap.h | 11 +++++++++++
 lib/sbitmap.c           | 22 +++++++++++++++++++---
 3 files changed, 62 insertions(+), 10 deletions(-)

Comments

Andy Shevchenko Jan. 11, 2022, 2:15 p.m. UTC | #1
On Tue, Jan 11, 2022 at 10:02:16PM +0800, Laibin Qiu wrote:
> In case of shared tags, there might be more than one hctx which
> allocates from the same tags, and each hctx is limited to allocate at
> most:
>         hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);
> 
> tag idle detection is lazy, and may be delayed for 30sec, so there
> could be just one real active hctx(queue) but all others are actually
> idle and still accounted as active because of the lazy idle detection.
> Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
> forever on this real active hctx.
> 
> Fix this by recalculating wake_batch when inc or dec active_queues.

...

>  {
> +	unsigned int users;

Missed blank line here.

>  	if (blk_mq_is_shared_tags(hctx->flags)) {
>  		struct request_queue *q = hctx->queue;
>  
> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {

Whoever wrote this code did too much defensive programming, because the first
conditional doesn't make much sense here. Am I right?

> +			return true;
> +		}
>  	} else {

> +		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
> +		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {

Ditto.

> +			return true;
> +		}
>  	}

...

> +	unsigned int wake_batch = clamp_t(unsigned int,
> +			(sbq->sb.depth + users - 1) / users, 4U, SBQ_WAKE_BATCH);


	unsigned int wake_batch;

	wake_batch = clamp_val((sbq->sb.depth + users - 1) / users, 4, SBQ_WAKE_BATCH);
	...

is easier to read, no?
QiuLaibin Jan. 12, 2022, 4:18 a.m. UTC | #2
On 2022/1/11 22:15, Andy Shevchenko wrote:
> On Tue, Jan 11, 2022 at 10:02:16PM +0800, Laibin Qiu wrote:
>> In case of shared tags, there might be more than one hctx which
>> allocates from the same tags, and each hctx is limited to allocate at
>> most:
>>          hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);
>>
>> tag idle detection is lazy, and may be delayed for 30sec, so there
>> could be just one real active hctx(queue) but all others are actually
>> idle and still accounted as active because of the lazy idle detection.
>> Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
>> forever on this real active hctx.
>>
>> Fix this by recalculating wake_batch when inc or dec active_queues.
> 
> ...
> 
>>   {
>> +	unsigned int users;
> 
> Missed blank line here.
Thanks, i will modify it in V5.
> 
>>   	if (blk_mq_is_shared_tags(hctx->flags)) {
>>   		struct request_queue *q = hctx->queue;
>>   
>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> 
> Whoever wrote this code did too much defensive programming, because the first
> conditional doesn't make much sense here. Am I right?
> 
I think because this judgement is in the general IO process, there are 
also some performance considerations here.
>> +			return true;
>> +		}
>>   	} else {
> 
>> +		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
>> +		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
> 
> Ditto.
> 
>> +			return true;
>> +		}
>>   	}
> 
> ...
> 
>> +	unsigned int wake_batch = clamp_t(unsigned int,
>> +			(sbq->sb.depth + users - 1) / users, 4U, SBQ_WAKE_BATCH);
> 
> 
> 	unsigned int wake_batch;
> 
> 	wake_batch = clamp_val((sbq->sb.depth + users - 1) / users, 4, SBQ_WAKE_BATCH);
> 	...
> 
> is easier to read, no?

Here I refer to the calculation method in sbq_calc_wake_batch(). And I 
will separate the definition from the calculation in V5.

>
Andy Shevchenko Jan. 12, 2022, 12:30 p.m. UTC | #3
On Wed, Jan 12, 2022 at 12:18:53PM +0800, QiuLaibin wrote:
> On 2022/1/11 22:15, Andy Shevchenko wrote:
> > On Tue, Jan 11, 2022 at 10:02:16PM +0800, Laibin Qiu wrote:

...

> > > +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> > > +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> > 
> > Whoever wrote this code did too much defensive programming, because the first
> > conditional doesn't make much sense here. Am I right?
> > 
> I think because this judgement is in the general IO process, there are also
> some performance considerations here.

I didn't buy this. Is there any better argument why you need redundant
test_bit() call?

> > > +			return true;

> > >   	} else {
> > 
> > > +		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
> > > +		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
> > 
> > Ditto.
> > 
> > > +			return true;

> > >   	}

...

> > > +	unsigned int wake_batch = clamp_t(unsigned int,
> > > +			(sbq->sb.depth + users - 1) / users, 4U, SBQ_WAKE_BATCH);
> > 
> > 
> > 	unsigned int wake_batch;
> > 
> > 	wake_batch = clamp_val((sbq->sb.depth + users - 1) / users, 4, SBQ_WAKE_BATCH);
> > 	...
> > 
> > is easier to read, no?
> 
> Here I refer to the calculation method in sbq_calc_wake_batch(). And I will
> separate the definition from the calculation in V5.

I'm not sure I understand how it's related to the style changes I proposed.
I haven't changed any logic behind.
John Garry Jan. 12, 2022, 12:51 p.m. UTC | #4
On 12/01/2022 12:30, Andy Shevchenko wrote:
>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>> Whoever wrote this code did too much defensive programming, because the first
>>> conditional doesn't make much sense here. Am I right?
>>>
>> I think because this judgement is in the general IO process, there are also
>> some performance considerations here.
> I didn't buy this. Is there any better argument why you need redundant
> test_bit() call?
> 

I think that the idea is that test_bit() is fast and test_and_set_bit() 
is slow; as such, if we generally expect the bit to be set, then there 
is no need to do the slower test_and_set_bit() always.
Andy Shevchenko Jan. 12, 2022, 2:38 p.m. UTC | #5
On Wed, Jan 12, 2022 at 12:51:13PM +0000, John Garry wrote:
> On 12/01/2022 12:30, Andy Shevchenko wrote:
> > > > > +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> > > > > +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> > > > Whoever wrote this code did too much defensive programming, because the first
> > > > conditional doesn't make much sense here. Am I right?
> > > > 
> > > I think because this judgement is in the general IO process, there are also
> > > some performance considerations here.
> > I didn't buy this. Is there any better argument why you need redundant
> > test_bit() call?
> 
> I think that the idea is that test_bit() is fast and test_and_set_bit() is
> slow; as such, if we generally expect the bit to be set, then there is no
> need to do the slower test_and_set_bit() always.

It doesn't sound thought through solution, the bit can be flipped in between,
so what is this all about? Maybe missing proper serialization somewhere else?
Jens Axboe Jan. 12, 2022, 3:37 p.m. UTC | #6
On 1/12/22 7:38 AM, Andy Shevchenko wrote:
> On Wed, Jan 12, 2022 at 12:51:13PM +0000, John Garry wrote:
>> On 12/01/2022 12:30, Andy Shevchenko wrote:
>>>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>>>> Whoever wrote this code did too much defensive programming, because the first
>>>>> conditional doesn't make much sense here. Am I right?
>>>>>
>>>> I think because this judgement is in the general IO process, there are also
>>>> some performance considerations here.
>>> I didn't buy this. Is there any better argument why you need redundant
>>> test_bit() call?
>>
>> I think that the idea is that test_bit() is fast and test_and_set_bit() is
>> slow; as such, if we generally expect the bit to be set, then there is no
>> need to do the slower test_and_set_bit() always.
> 
> It doesn't sound thought through solution, the bit can be flipped in
> between, so what is this all about? Maybe missing proper serialization
> somewhere else?

You need to work on your communication a bit - if there's a piece of
code you don't understand, maybe try being a bit less aggressive about
it? Otherwise people tend to just ignore you rather than explain it.

test_bit() is a lot faster than a test_and_set_bit(), and there's no
need to run the latter if the former returns true. This is a pretty
common optimization, particularly if the majority of the calls end up
having the bit set already.

Can the bit be flipped right after? Certainly! Can that happen if just
test_and_set_bit() is used? Of course! This isn't a critical section
with a lock, it's a piece of state. And guarding the RMW operation with
a test first doesn't change that one bit.
Andy Shevchenko Jan. 12, 2022, 4:29 p.m. UTC | #7
On Wed, Jan 12, 2022 at 08:37:34AM -0700, Jens Axboe wrote:
> On 1/12/22 7:38 AM, Andy Shevchenko wrote:
> > On Wed, Jan 12, 2022 at 12:51:13PM +0000, John Garry wrote:
> >> On 12/01/2022 12:30, Andy Shevchenko wrote:
> >>>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
> >>>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
> >>>>> Whoever wrote this code did too much defensive programming, because the first
> >>>>> conditional doesn't make much sense here. Am I right?
> >>>>>
> >>>> I think because this judgement is in the general IO process, there are also
> >>>> some performance considerations here.
> >>> I didn't buy this. Is there any better argument why you need redundant
> >>> test_bit() call?
> >>
> >> I think that the idea is that test_bit() is fast and test_and_set_bit() is
> >> slow; as such, if we generally expect the bit to be set, then there is no
> >> need to do the slower test_and_set_bit() always.
> > 
> > It doesn't sound thought through solution, the bit can be flipped in
> > between, so what is this all about? Maybe missing proper serialization
> > somewhere else?
> 
> You need to work on your communication a bit - if there's a piece of
> code you don't understand, maybe try being a bit less aggressive about
> it? Otherwise people tend to just ignore you rather than explain it.

Sure. Thanks for below explanations, btw.

> test_bit() is a lot faster than a test_and_set_bit(), and there's no
> need to run the latter if the former returns true. This is a pretty
> common optimization, particularly if the majority of the calls end up
> having the bit set already.

I don't see how it may give optimization here generally speaking.
If we know that bit is _often_ is set, than of course it sounds
like opportunistic win. Otherwise, it may have the opposite effect.

I.o.w. without knowing the statistics of the bit being set/clear it's
hard to say if it's optimization or waste of the (CPU) resources.

> Can the bit be flipped right after? Certainly! Can that happen if just
> test_and_set_bit() is used? Of course! This isn't a critical section
> with a lock, it's a piece of state. And guarding the RMW operation with
> a test first doesn't change that one bit.
Jens Axboe Jan. 12, 2022, 4:38 p.m. UTC | #8
On 1/12/22 9:29 AM, Andy Shevchenko wrote:
> On Wed, Jan 12, 2022 at 08:37:34AM -0700, Jens Axboe wrote:
>> On 1/12/22 7:38 AM, Andy Shevchenko wrote:
>>> On Wed, Jan 12, 2022 at 12:51:13PM +0000, John Garry wrote:
>>>> On 12/01/2022 12:30, Andy Shevchenko wrote:
>>>>>>>> +		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
>>>>>>>> +		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
>>>>>>> Whoever wrote this code did too much defensive programming, because the first
>>>>>>> conditional doesn't make much sense here. Am I right?
>>>>>>>
>>>>>> I think because this judgement is in the general IO process, there are also
>>>>>> some performance considerations here.
>>>>> I didn't buy this. Is there any better argument why you need redundant
>>>>> test_bit() call?
>>>>
>>>> I think that the idea is that test_bit() is fast and test_and_set_bit() is
>>>> slow; as such, if we generally expect the bit to be set, then there is no
>>>> need to do the slower test_and_set_bit() always.
>>>
>>> It doesn't sound thought through solution, the bit can be flipped in
>>> between, so what is this all about? Maybe missing proper serialization
>>> somewhere else?
>>
>> You need to work on your communication a bit - if there's a piece of
>> code you don't understand, maybe try being a bit less aggressive about
>> it? Otherwise people tend to just ignore you rather than explain it.
> 
> Sure. Thanks for below explanations, btw.
> 
>> test_bit() is a lot faster than a test_and_set_bit(), and there's no
>> need to run the latter if the former returns true. This is a pretty
>> common optimization, particularly if the majority of the calls end up
>> having the bit set already.
> 
> I don't see how it may give optimization here generally speaking.
> If we know that bit is _often_ is set, than of course it sounds
> like opportunistic win. Otherwise, it may have the opposite effect.
> 
> I.o.w. without knowing the statistics of the bit being set/clear it's
> hard to say if it's optimization or waste of the (CPU) resources.

It doesn't really matter that much, the test_bit() is essentially free
compared to the test and set. Which means that in practice there's
little downside, and the upsides on when the bit is set is pretty big.

This technique has proven very useful in other spots, only downside is
really that there's no general helper to do this. That would also help
the readability.
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e55a6834c9a6..48b00d56141a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -16,6 +16,21 @@ 
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
 
+/*
+ * Recalculate wakeup batch when tag is shared by hctx.
+ */
+static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
+		unsigned int users)
+{
+	if (!users)
+		return;
+
+	sbitmap_queue_recalculate_wake_batch(&tags->bitmap_tags,
+			users);
+	sbitmap_queue_recalculate_wake_batch(&tags->breserved_tags,
+			users);
+}
+
 /*
  * If a previously inactive queue goes active, bump the active user count.
  * We need to do this before try to allocate driver tag, then even if fail
@@ -24,18 +39,25 @@ 
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
+	unsigned int users;
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
 
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
-		    !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			atomic_inc(&hctx->tags->active_queues);
+		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
+		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
+			return true;
+		}
 	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-		    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			atomic_inc(&hctx->tags->active_queues);
+		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
+		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
+			return true;
+		}
 	}
 
+	users = atomic_inc_return(&hctx->tags->active_queues);
+
+	blk_mq_update_wake_batch(hctx->tags, users);
+
 	return true;
 }
 
@@ -56,6 +78,7 @@  void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_tags *tags = hctx->tags;
+	unsigned int users;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
@@ -68,7 +91,9 @@  void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 			return;
 	}
 
-	atomic_dec(&tags->active_queues);
+	users = atomic_dec_return(&tags->active_queues);
+
+	blk_mq_update_wake_batch(tags, users);
 
 	blk_mq_tag_wakeup_all(tags, false);
 }
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index fc0357a6e19b..95df357ec009 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -415,6 +415,17 @@  static inline void sbitmap_queue_free(struct sbitmap_queue *sbq)
 	sbitmap_free(&sbq->sb);
 }
 
+/**
+ * sbitmap_queue_recalculate_wake_batch() - Recalculate wake batch
+ * @sbq: Bitmap queue to recalculate wake batch.
+ * @users: Number of shares.
+ *
+ * Like sbitmap_queue_update_wake_batch(), this will calculate wake batch
+ * by depth. This interface is for HCTX shared tags or queue shared tags.
+ */
+void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
+					    unsigned int users);
+
 /**
  * sbitmap_queue_resize() - Resize a &struct sbitmap_queue.
  * @sbq: Bitmap queue to resize.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2709ab825499..94b3272effd8 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -457,10 +457,9 @@  int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
 
-static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
-					    unsigned int depth)
+static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
+					    unsigned int wake_batch)
 {
-	unsigned int wake_batch = sbq_calc_wake_batch(sbq, depth);
 	int i;
 
 	if (sbq->wake_batch != wake_batch) {
@@ -476,6 +475,23 @@  static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
 	}
 }
 
+static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
+					    unsigned int depth)
+{
+	unsigned int wake_batch = sbq_calc_wake_batch(sbq, depth);
+
+	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
+}
+
+void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
+					    unsigned int users)
+{
+	unsigned int wake_batch = clamp_t(unsigned int,
+			(sbq->sb.depth + users - 1) / users, 4U, SBQ_WAKE_BATCH);
+	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
+}
+EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
+
 void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth)
 {
 	sbitmap_queue_update_wake_batch(sbq, depth);