block: Improve shared tag set performance

Message ID	20231018180056.2151711-1-bvanassche@acm.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Bart Van Assche <bvanassche@acm.org> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Bart Van Assche <bvanassche@acm.org>, Christoph Hellwig <hch@lst.de>, "Martin K . Petersen" <martin.petersen@oracle.com>, Ming Lei <ming.lei@redhat.com>, Keith Busch <kbusch@kernel.org>, Damien Le Moal <damien.lemoal@opensource.wdc.com>, Yu Kuai <yukuai1@huaweicloud.com>, Ed Tsai <ed.tsai@mediatek.com> Subject: [PATCH] block: Improve shared tag set performance Date: Wed, 18 Oct 2023 11:00:56 -0700 Message-ID: <20231018180056.2151711-1-bvanassche@acm.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	block: Improve shared tag set performance \| expand block: Improve shared tag set performance

Bart Van Assche Oct. 18, 2023, 6 p.m. UTC

Remove the code for fair tag sharing because it significantly hurts
performance for UFS devices. Removing this code is safe because the
legacy block layer worked fine without any equivalent fairness
algorithm.

This algorithm hurts performance for UFS devices because UFS devices
have multiple logical units. One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs.

See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/

Note: it has been attempted to rework this algorithm. See also "[PATCH
RFC 0/7] blk-mq: improve tag fair sharing"
(https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
Given the complexity of that patch series, I do not expect that patch
series to be merged.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-tag.c |  4 ----
 block/blk-mq.c     |  3 ---
 block/blk-mq.h     | 39 ---------------------------------------
 3 files changed, 46 deletions(-)

Christoph Hellwig Oct. 20, 2023, 4:41 a.m. UTC | #1

On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
> Note: it has been attempted to rework this algorithm. See also "[PATCH
> RFC 0/7] blk-mq: improve tag fair sharing"
> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> Given the complexity of that patch series, I do not expect that patch
> series to be merged.

Work is hard, so let's skip it?  That's not really the most convincing
argument.  Hey, I'm the biggest advocate for code improvement by code
removal, but you better have a really good argument why it doesn't hurt
anyone.

Bart Van Assche Oct. 20, 2023, 4:17 p.m. UTC | #2

On 10/19/23 21:41, Christoph Hellwig wrote:
> On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
>> Note: it has been attempted to rework this algorithm. See also "[PATCH
>> RFC 0/7] blk-mq: improve tag fair sharing"
>> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
>> Given the complexity of that patch series, I do not expect that patch
>> series to be merged.
> 
> Work is hard, so let's skip it?  That's not really the most convincing
> argument.  Hey, I'm the biggest advocate for code improvement by code
> removal, but you better have a really good argument why it doesn't hurt
> anyone.

Hi Christoph,

No, it's not because it's hard to improve the tag fairness algorithm
that I'm proposing to skip this work. It's because I'm convinced that
an improved fairness algorithm will have a negative performance impact
that is larger than that of the current algorithm.

Do you agree that the legacy block layer never had any such fairness
algorithm and also that nobody ever complained about fairness issues
for the legacy block layer? I think that's a strong argument in favor of
removing the fairness algorithm.

Thanks,

Bart.

Keith Busch Oct. 20, 2023, 4:25 p.m. UTC | #3

On Fri, Oct 20, 2023 at 09:17:11AM -0700, Bart Van Assche wrote:
> On 10/19/23 21:41, Christoph Hellwig wrote:
> > On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
> > > Note: it has been attempted to rework this algorithm. See also "[PATCH
> > > RFC 0/7] blk-mq: improve tag fair sharing"
> > > (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> > > Given the complexity of that patch series, I do not expect that patch
> > > series to be merged.
> > 
> > Work is hard, so let's skip it?  That's not really the most convincing
> > argument.  Hey, I'm the biggest advocate for code improvement by code
> > removal, but you better have a really good argument why it doesn't hurt
> > anyone.
> 
> Hi Christoph,
> 
> No, it's not because it's hard to improve the tag fairness algorithm
> that I'm proposing to skip this work. It's because I'm convinced that
> an improved fairness algorithm will have a negative performance impact
> that is larger than that of the current algorithm.
> 
> Do you agree that the legacy block layer never had any such fairness
> algorithm and also that nobody ever complained about fairness issues
> for the legacy block layer? I think that's a strong argument in favor of
> removing the fairness algorithm.

The legacy block request layer didn't have a tag resource shared among
multiple request queues. Each queue had their own mempool for allocating
requests. The mempool, I think, would always guarantee everyone could
get at least one request.

Bart Van Assche Oct. 20, 2023, 4:45 p.m. UTC | #4

On 10/20/23 09:25, Keith Busch wrote:
> The legacy block request layer didn't have a tag resource shared among
> multiple request queues. Each queue had their own mempool for allocating
> requests. The mempool, I think, would always guarantee everyone could
> get at least one request.

I think that the above is irrelevant in this context. As an example, 
SCSI devices have always shared a pool of tags across multiple logical
units. This behavior has not been changed by the conversion of the
SCSI core from the legacy block layer to blk-mq.

For other (hardware) block devices it didn't matter either that there
was no upper limit to the number of requests the legacy block layer
could allocate. All hardware block devices I know support fixed size
queues for queuing requests to the block device.

Bart.

Keith Busch Oct. 20, 2023, 5:09 p.m. UTC | #5

On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
> 
> On 10/20/23 09:25, Keith Busch wrote:
> > The legacy block request layer didn't have a tag resource shared among
> > multiple request queues. Each queue had their own mempool for allocating
> > requests. The mempool, I think, would always guarantee everyone could
> > get at least one request.
> 
> I think that the above is irrelevant in this context. As an example, SCSI
> devices have always shared a pool of tags across multiple logical
> units. This behavior has not been changed by the conversion of the
> SCSI core from the legacy block layer to blk-mq.
> 
> For other (hardware) block devices it didn't matter either that there
> was no upper limit to the number of requests the legacy block layer
> could allocate. All hardware block devices I know support fixed size
> queues for queuing requests to the block device.

I am not sure I understand your point. Those lower layers always were
able to get at least one request per request_queue. They can do whatever
they want with it after that. This change removes that guarantee for
blk-mq in some cases, right? I just don't think you can readily conclude
that is "safe" by appealing to the legacy behavior, that's all.

Bart Van Assche Oct. 20, 2023, 5:54 p.m. UTC | #6

On 10/20/23 10:09, Keith Busch wrote:
> On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
>>
>> On 10/20/23 09:25, Keith Busch wrote:
>>> The legacy block request layer didn't have a tag resource shared among
>>> multiple request queues. Each queue had their own mempool for allocating
>>> requests. The mempool, I think, would always guarantee everyone could
>>> get at least one request.
>>
>> I think that the above is irrelevant in this context. As an example, SCSI
>> devices have always shared a pool of tags across multiple logical
>> units. This behavior has not been changed by the conversion of the
>> SCSI core from the legacy block layer to blk-mq.
>>
>> For other (hardware) block devices it didn't matter either that there
>> was no upper limit to the number of requests the legacy block layer
>> could allocate. All hardware block devices I know support fixed size
>> queues for queuing requests to the block device.
> 
> I am not sure I understand your point. Those lower layers always were
> able to get at least one request per request_queue. They can do whatever
> they want with it after that. This change removes that guarantee for
> blk-mq in some cases, right? I just don't think you can readily conclude
> that is "safe" by appealing to the legacy behavior, that's all.

Hi Keith,

How requests were allocated in the legacy block layer is irrelevant in
this context. The patch I posted affects the tag allocation strategy.
Tag allocation happened in the legacy block layer by calling
blk_queue_start_tag(). From Linux kernel v4.20:

/**
  * blk_queue_start_tag - find a free tag and assign it
  * @q:  the request queue for the device
  * @rq:  the block request that needs tagging
  * [ ... ]
  **/

That function supports sharing tags between request queues but did not
attempt to be fair at all. This is how the SCSI core in Linux kernel 
v4.20 sets up tag sharing between request queues (from 
drivers/scsi/scsi_scan.c):

	if (!shost_use_blk_mq(sdev->host)) {
		blk_queue_init_tags(sdev->request_queue,
				    sdev->host->cmd_per_lun, shost->bqt,
				    shost->hostt->tag_alloc_policy);
	}

blk-mq has always had a fairness algorithm in case a tag set is shared
across request queues. If a tag set is shared across request queues, the
number of tags per request queue is restricted to the total number of
tags divided by the number of users (ignoring rounding). From
block/blk-mq.c in the latest kernel:

	depth = max((bt->sb.depth + users - 1) / users, 4U);

What my patch does is to remove this fairness guarantee. There was no
equivalent of this fairness guarantee in the legacy block layer.

Thanks,

Bart.

Bart Van Assche Oct. 20, 2023, 7:11 p.m. UTC | #7

On 10/18/23 11:00, Bart Van Assche wrote:
> -/*
> - * For shared tag users, we track the number of currently active users
> - * and attempt to provide a fair share of the tag depth for each of them.
> - */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> -				  struct sbitmap_queue *bt)
> -{
> -	unsigned int depth, users;
> -
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> -		return true;
> -
> -	/*
> -	 * Don't try dividing an ant
> -	 */
> -	if (bt->sb.depth == 1)
> -		return true;
> -
> -	if (blk_mq_is_shared_tags(hctx->flags)) {
> -		struct request_queue *q = hctx->queue;
> -
> -		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> -			return true;
> -	} else {
> -		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -			return true;
> -	}
> -
> -	users = READ_ONCE(hctx->tags->active_queues);
> -	if (!users)
> -		return true;
> -
> -	/*
> -	 * Allow at least some tags
> -	 */
> -	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return __blk_mq_active_requests(hctx) < depth;
> -}

(replying to my own email)

Hi Jens,

Do you perhaps remember why the above code was introduced by commit
0d2602ca30e4 ("blk-mq: improve support for shared tags maps") in 2014
(nine years ago)?

Thanks,

Bart.

Ming Lei Oct. 21, 2023, 1:31 a.m. UTC | #8

On Fri, Oct 20, 2023 at 10:54:59AM -0700, Bart Van Assche wrote:
> 
> On 10/20/23 10:09, Keith Busch wrote:
> > On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
> > > 
> > > On 10/20/23 09:25, Keith Busch wrote:
> > > > The legacy block request layer didn't have a tag resource shared among
> > > > multiple request queues. Each queue had their own mempool for allocating
> > > > requests. The mempool, I think, would always guarantee everyone could
> > > > get at least one request.
> > > 
> > > I think that the above is irrelevant in this context. As an example, SCSI
> > > devices have always shared a pool of tags across multiple logical
> > > units. This behavior has not been changed by the conversion of the
> > > SCSI core from the legacy block layer to blk-mq.
> > > 
> > > For other (hardware) block devices it didn't matter either that there
> > > was no upper limit to the number of requests the legacy block layer
> > > could allocate. All hardware block devices I know support fixed size
> > > queues for queuing requests to the block device.
> > 
> > I am not sure I understand your point. Those lower layers always were
> > able to get at least one request per request_queue. They can do whatever
> > they want with it after that. This change removes that guarantee for
> > blk-mq in some cases, right? I just don't think you can readily conclude
> > that is "safe" by appealing to the legacy behavior, that's all.
> 
> Hi Keith,
> 
> How requests were allocated in the legacy block layer is irrelevant in
> this context. The patch I posted affects the tag allocation strategy.
> Tag allocation happened in the legacy block layer by calling
> blk_queue_start_tag(). From Linux kernel v4.20:
> 
> /**
>  * blk_queue_start_tag - find a free tag and assign it
>  * @q:  the request queue for the device
>  * @rq:  the block request that needs tagging
>  * [ ... ]
>  **/
> 
> That function supports sharing tags between request queues but did not
> attempt to be fair at all. This is how the SCSI core in Linux kernel v4.20
> sets up tag sharing between request queues (from drivers/scsi/scsi_scan.c):
> 
> 	if (!shost_use_blk_mq(sdev->host)) {
> 		blk_queue_init_tags(sdev->request_queue,
> 				    sdev->host->cmd_per_lun, shost->bqt,
> 				    shost->hostt->tag_alloc_policy);
> 	}
> 
> blk-mq has always had a fairness algorithm in case a tag set is shared
> across request queues. If a tag set is shared across request queues, the
> number of tags per request queue is restricted to the total number of
> tags divided by the number of users (ignoring rounding). From
> block/blk-mq.c in the latest kernel:
> 
> 	depth = max((bt->sb.depth + users - 1) / users, 4U);
> 
> What my patch does is to remove this fairness guarantee. There was no
> equivalent of this fairness guarantee in the legacy block layer.

If two LUNs are attached to same host, one is slow, and another is fast,
and the slow LUN can slow down the fast LUN easily without this fairness
algorithm.

Your motivation is that "One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs." I guess one simple fix is to not account queues
of this non-IO LUN as active queues?



Thanks, 
Ming

Yu Kuai Oct. 21, 2023, 7:32 a.m. UTC | #9

Hi,

在 2023/10/19 2:00, Bart Van Assche 写道:
> Remove the code for fair tag sharing because it significantly hurts
> performance for UFS devices. Removing this code is safe because the
> legacy block layer worked fine without any equivalent fairness
> algorithm.
> 
> This algorithm hurts performance for UFS devices because UFS devices
> have multiple logical units. One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs.
> 
> See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/
> 
> Note: it has been attempted to rework this algorithm. See also "[PATCH
> RFC 0/7] blk-mq: improve tag fair sharing"
> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> Given the complexity of that patch series, I do not expect that patch
> series to be merged.

Sorry for such huge delay, I was struggled on implementing a smoothly
algorithm to borrow tags and return borrowed tags, and later I put this
on ice and focus on other stuff.

I had an idea to implement a state machine, however, the amount of code
was aggressive and I gave up. And later, I implemented a simple version,
and I tested it in your case, 32 tags and 2 shared node, result looks
good(see below), however, I'm not confident this can work well general.

Anyway, I'll send a new RFC verion for this, and please let me know if
you still think this approch is unacceptable.

Thanks,
Kuai

Test script:

[global]
ioengine=libaio
iodepth=2
bs=4k
direct=1
rw=randrw
group_reporting

[sda]
numjobs=32
filename=/dev/sda

[sdb]
numjobs=1
filename=/dev/sdb

Test result, by monitor new debugfs entry shared_tag_info:
time	active		available
	sda 	sdb	sda	sdb
0	0	0	32	32
1	16	2	16	16	-> start fair sharing
2	19	2	20	16
3	24	2	24	16
4	26	2	28	16 	-> borrow 32/8=4 tags each round
5	28	2	28	16	-> save at lease 4 tags for sdb
...
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Yu Kuai <yukuai1@huaweicloud.com>
> Cc: Ed Tsai <ed.tsai@mediatek.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-mq-tag.c |  4 ----
>   block/blk-mq.c     |  3 ---
>   block/blk-mq.h     | 39 ---------------------------------------
>   3 files changed, 46 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index cc57e2dd9a0b..25334bfcabf8 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>   			    struct sbitmap_queue *bt)
>   {
> -	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
> -			!hctx_may_queue(data->hctx, bt))
> -		return BLK_MQ_NO_TAG;
> -
>   	if (data->shallow_depth)
>   		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
>   	else
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e2d11183f62e..502dafa76716 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1760,9 +1760,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
>   	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
>   		bt = &rq->mq_hctx->tags->breserved_tags;
>   		tag_offset = 0;
> -	} else {
> -		if (!hctx_may_queue(rq->mq_hctx, bt))
> -			return false;
>   	}
>   
>   	tag = __sbitmap_queue_get(bt);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index f75a9ecfebde..14a22f6d3fdf 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -407,45 +407,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
>   	}
>   }
>   
> -/*
> - * For shared tag users, we track the number of currently active users
> - * and attempt to provide a fair share of the tag depth for each of them.
> - */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> -				  struct sbitmap_queue *bt)
> -{
> -	unsigned int depth, users;
> -
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> -		return true;
> -
> -	/*
> -	 * Don't try dividing an ant
> -	 */
> -	if (bt->sb.depth == 1)
> -		return true;
> -
> -	if (blk_mq_is_shared_tags(hctx->flags)) {
> -		struct request_queue *q = hctx->queue;
> -
> -		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> -			return true;
> -	} else {
> -		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -			return true;
> -	}
> -
> -	users = READ_ONCE(hctx->tags->active_queues);
> -	if (!users)
> -		return true;
> -
> -	/*
> -	 * Allow at least some tags
> -	 */
> -	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return __blk_mq_active_requests(hctx) < depth;
> -}
> -
>   /* run the code block in @dispatch_ops with rcu/srcu read lock held */
>   #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
>   do {								\
> 
> .
>

Bart Van Assche Oct. 21, 2023, 4:13 p.m. UTC | #10

On 10/20/23 18:31, Ming Lei wrote:
> If two LUNs are attached to same host, one is slow, and another is fast,
> and the slow LUN can slow down the fast LUN easily without this fairness
> algorithm.
> 
> Your motivation is that "One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs." I guess one simple fix is to not account queues
> of this non-IO LUN as active queues?

Hi Ming,

For fast storage devices (e.g. UFS) any time spent in an algorithm for
fair sharing will reduce IOPS. If there are big differences in the
request processing latency between different request queues then fair
sharing is beneficial. Whether or not the fair sharing algorithm is
improved, how about making it easy to disable fair sharing, e.g. with
something like the untested patch below? I think that will work better
than ignoring fair sharing per LUN. UFS devices support multiple logical
units and with the current fair sharing approach it takes long until
tags are taken away from an inactive LUN (request queue timeout).

Thanks,

Bart.

diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..b06b161d06de 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct 
blk_mq_hw_ctx *hctx,
  {
  	unsigned int depth, users;

-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
+	    hctx->queue->disable_fair_sharing)
  		return true;

  	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eef450f25982..63b04cf65887 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -523,6 +523,7 @@ struct request_queue {
  	struct mutex		debugfs_mutex;

  	bool			mq_sysfs_init_done;
+	bool			disable_fair_sharing;
  };

  /* Keep blk_queue_flag_name[] in sync with the definitions below */

Bart Van Assche Oct. 21, 2023, 4:21 p.m. UTC | #11

On 10/21/23 00:32, Yu Kuai wrote:
> Sorry for such huge delay, I was struggled on implementing a smoothly
> algorithm to borrow tags and return borrowed tags, and later I put this
> on ice and focus on other stuff.
> 
> I had an idea to implement a state machine, however, the amount of code
> was aggressive and I gave up. And later, I implemented a simple version,
> and I tested it in your case, 32 tags and 2 shared node, result looks
> good(see below), however, I'm not confident this can work well general.
> 
> Anyway, I'll send a new RFC verion for this, and please let me know if
> you still think this approch is unacceptable.
> 
> Thanks,
> Kuai
> 
> Test script:
> 
> [global]
> ioengine=libaio
> iodepth=2
> bs=4k
> direct=1
> rw=randrw
> group_reporting
> 
> [sda]
> numjobs=32
> filename=/dev/sda
> 
> [sdb]
> numjobs=1
> filename=/dev/sdb
> 
> Test result, by monitor new debugfs entry shared_tag_info:
> time    active        available
>      sda     sdb    sda    sdb
> 0    0    0    32    32
> 1    16    2    16    16    -> start fair sharing
> 2    19    2    20    16
> 3    24    2    24    16
> 4    26    2    28    16     -> borrow 32/8=4 tags each round
> 5    28    2    28    16    -> save at lease 4 tags for sdb

Hi Yu,

Thank you for having shared these results. What is the unit of the
numbers in the time column?

In the above I see that more tags are assigned to sda than to sdb
although I/O is being submitted to both LUNs. I think the current
algoritm defines fairness as dividing tags in a fair way across active
LUNs. Do the above results show that tags are divided per active job
instead of per active LUN? If so, I'm not sure that everyone will agree
that this is a fair way to distribute tags ...

Thanks,

Bart.

Yu Kuai Oct. 23, 2023, 1:11 a.m. UTC | #12

Hi,

在 2023/10/22 0:21, Bart Van Assche 写道:
> 
> On 10/21/23 00:32, Yu Kuai wrote:
>> Sorry for such huge delay, I was struggled on implementing a smoothly
>> algorithm to borrow tags and return borrowed tags, and later I put this
>> on ice and focus on other stuff.
>>
>> I had an idea to implement a state machine, however, the amount of code
>> was aggressive and I gave up. And later, I implemented a simple version,
>> and I tested it in your case, 32 tags and 2 shared node, result looks
>> good(see below), however, I'm not confident this can work well general.
>>
>> Anyway, I'll send a new RFC verion for this, and please let me know if
>> you still think this approch is unacceptable.
>>
>> Thanks,
>> Kuai
>>
>> Test script:
>>
>> [global]
>> ioengine=libaio
>> iodepth=2
>> bs=4k
>> direct=1
>> rw=randrw
>> group_reporting
>>
>> [sda]
>> numjobs=32
>> filename=/dev/sda
>>
>> [sdb]
>> numjobs=1
>> filename=/dev/sdb
>>
>> Test result, by monitor new debugfs entry shared_tag_info:
>> time    active        available
>>      sda     sdb    sda    sdb
>> 0    0    0    32    32
>> 1    16    2    16    16    -> start fair sharing
>> 2    19    2    20    16
>> 3    24    2    24    16
>> 4    26    2    28    16     -> borrow 32/8=4 tags each round
>> 5    28    2    28    16    -> save at lease 4 tags for sdb
> 
> Hi Yu,
> 
> Thank you for having shared these results. What is the unit of the
> numbers in the time column?


I added a timer in blk_mq_tags, and timer function is used to implement
borrow tags and return borrowed tags, the timer will start when one node
is busy, and will expire in HZ, so the time means each second.
> 
> In the above I see that more tags are assigned to sda than to sdb
> although I/O is being submitted to both LUNs. I think the current
> algoritm defines fairness as dividing tags in a fair way across active
> LUNs. Do the above results show that tags are divided per active job
> instead of per active LUN? If so, I'm not sure that everyone will agree
> that this is a fair way to distribute tags ...

Yes, active tag is divided into per active LUN, specifically each
request_queue or hctx that is sharing tags.

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> 
> 
> .
>

Ming Lei Oct. 23, 2023, 3:44 a.m. UTC | #13

On Sat, Oct 21, 2023 at 09:13:38AM -0700, Bart Van Assche wrote:
> On 10/20/23 18:31, Ming Lei wrote:
> > If two LUNs are attached to same host, one is slow, and another is fast,
> > and the slow LUN can slow down the fast LUN easily without this fairness
> > algorithm.
> > 
> > Your motivation is that "One of these logical units (WLUN) is used
> > to submit control commands, e.g. START STOP UNIT. If any request is
> > submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> > lower for data LUNs." I guess one simple fix is to not account queues
> > of this non-IO LUN as active queues?
> 
> Hi Ming,
> 
> For fast storage devices (e.g. UFS) any time spent in an algorithm for
> fair sharing will reduce IOPS. If there are big differences in the
> request processing latency between different request queues then fair
> sharing is beneficial. Whether or not the fair sharing algorithm is
> improved, how about making it easy to disable fair sharing, e.g. with
> something like the untested patch below? I think that will work better
> than ignoring fair sharing per LUN. UFS devices support multiple logical
> units and with the current fair sharing approach it takes long until
> tags are taken away from an inactive LUN (request queue timeout).
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index f75a9ecfebde..b06b161d06de 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>  {
>  	unsigned int depth, users;
> 
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> +	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
> +	    hctx->queue->disable_fair_sharing)

Maybe you can propagate this flag into hctx->flags, then
hctx->queue->disable_fair_sharing can be avoided in fast path.

>  		return true;
> 
>  	/*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index eef450f25982..63b04cf65887 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -523,6 +523,7 @@ struct request_queue {
>  	struct mutex		debugfs_mutex;
> 
>  	bool			mq_sysfs_init_done;
> +	bool			disable_fair_sharing;

You also need to bypass blk_mq_tag_busy() & blk_mq_tag_idle()
in case of disable_fair_sharing which should only be set for
non-IO queues, such as UFS WLUN, and maybe nvme connection queues.


Thanks, 
Ming

block: Improve shared tag set performance

Commit Message

Comments

Patch