Message ID | 20231018180056.2151711-1-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: Improve shared tag set performance | expand |
On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote: > Note: it has been attempted to rework this algorithm. See also "[PATCH > RFC 0/7] blk-mq: improve tag fair sharing" > (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/). > Given the complexity of that patch series, I do not expect that patch > series to be merged. Work is hard, so let's skip it? That's not really the most convincing argument. Hey, I'm the biggest advocate for code improvement by code removal, but you better have a really good argument why it doesn't hurt anyone.
On 10/19/23 21:41, Christoph Hellwig wrote: > On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote: >> Note: it has been attempted to rework this algorithm. See also "[PATCH >> RFC 0/7] blk-mq: improve tag fair sharing" >> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/). >> Given the complexity of that patch series, I do not expect that patch >> series to be merged. > > Work is hard, so let's skip it? That's not really the most convincing > argument. Hey, I'm the biggest advocate for code improvement by code > removal, but you better have a really good argument why it doesn't hurt > anyone. Hi Christoph, No, it's not because it's hard to improve the tag fairness algorithm that I'm proposing to skip this work. It's because I'm convinced that an improved fairness algorithm will have a negative performance impact that is larger than that of the current algorithm. Do you agree that the legacy block layer never had any such fairness algorithm and also that nobody ever complained about fairness issues for the legacy block layer? I think that's a strong argument in favor of removing the fairness algorithm. Thanks, Bart.
On Fri, Oct 20, 2023 at 09:17:11AM -0700, Bart Van Assche wrote: > On 10/19/23 21:41, Christoph Hellwig wrote: > > On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote: > > > Note: it has been attempted to rework this algorithm. See also "[PATCH > > > RFC 0/7] blk-mq: improve tag fair sharing" > > > (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/). > > > Given the complexity of that patch series, I do not expect that patch > > > series to be merged. > > > > Work is hard, so let's skip it? That's not really the most convincing > > argument. Hey, I'm the biggest advocate for code improvement by code > > removal, but you better have a really good argument why it doesn't hurt > > anyone. > > Hi Christoph, > > No, it's not because it's hard to improve the tag fairness algorithm > that I'm proposing to skip this work. It's because I'm convinced that > an improved fairness algorithm will have a negative performance impact > that is larger than that of the current algorithm. > > Do you agree that the legacy block layer never had any such fairness > algorithm and also that nobody ever complained about fairness issues > for the legacy block layer? I think that's a strong argument in favor of > removing the fairness algorithm. The legacy block request layer didn't have a tag resource shared among multiple request queues. Each queue had their own mempool for allocating requests. The mempool, I think, would always guarantee everyone could get at least one request.
On 10/20/23 09:25, Keith Busch wrote: > The legacy block request layer didn't have a tag resource shared among > multiple request queues. Each queue had their own mempool for allocating > requests. The mempool, I think, would always guarantee everyone could > get at least one request. I think that the above is irrelevant in this context. As an example, SCSI devices have always shared a pool of tags across multiple logical units. This behavior has not been changed by the conversion of the SCSI core from the legacy block layer to blk-mq. For other (hardware) block devices it didn't matter either that there was no upper limit to the number of requests the legacy block layer could allocate. All hardware block devices I know support fixed size queues for queuing requests to the block device. Bart.
On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote: > > On 10/20/23 09:25, Keith Busch wrote: > > The legacy block request layer didn't have a tag resource shared among > > multiple request queues. Each queue had their own mempool for allocating > > requests. The mempool, I think, would always guarantee everyone could > > get at least one request. > > I think that the above is irrelevant in this context. As an example, SCSI > devices have always shared a pool of tags across multiple logical > units. This behavior has not been changed by the conversion of the > SCSI core from the legacy block layer to blk-mq. > > For other (hardware) block devices it didn't matter either that there > was no upper limit to the number of requests the legacy block layer > could allocate. All hardware block devices I know support fixed size > queues for queuing requests to the block device. I am not sure I understand your point. Those lower layers always were able to get at least one request per request_queue. They can do whatever they want with it after that. This change removes that guarantee for blk-mq in some cases, right? I just don't think you can readily conclude that is "safe" by appealing to the legacy behavior, that's all.
On 10/20/23 10:09, Keith Busch wrote: > On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote: >> >> On 10/20/23 09:25, Keith Busch wrote: >>> The legacy block request layer didn't have a tag resource shared among >>> multiple request queues. Each queue had their own mempool for allocating >>> requests. The mempool, I think, would always guarantee everyone could >>> get at least one request. >> >> I think that the above is irrelevant in this context. As an example, SCSI >> devices have always shared a pool of tags across multiple logical >> units. This behavior has not been changed by the conversion of the >> SCSI core from the legacy block layer to blk-mq. >> >> For other (hardware) block devices it didn't matter either that there >> was no upper limit to the number of requests the legacy block layer >> could allocate. All hardware block devices I know support fixed size >> queues for queuing requests to the block device. > > I am not sure I understand your point. Those lower layers always were > able to get at least one request per request_queue. They can do whatever > they want with it after that. This change removes that guarantee for > blk-mq in some cases, right? I just don't think you can readily conclude > that is "safe" by appealing to the legacy behavior, that's all. Hi Keith, How requests were allocated in the legacy block layer is irrelevant in this context. The patch I posted affects the tag allocation strategy. Tag allocation happened in the legacy block layer by calling blk_queue_start_tag(). From Linux kernel v4.20: /** * blk_queue_start_tag - find a free tag and assign it * @q: the request queue for the device * @rq: the block request that needs tagging * [ ... ] **/ That function supports sharing tags between request queues but did not attempt to be fair at all. This is how the SCSI core in Linux kernel v4.20 sets up tag sharing between request queues (from drivers/scsi/scsi_scan.c): if (!shost_use_blk_mq(sdev->host)) { blk_queue_init_tags(sdev->request_queue, sdev->host->cmd_per_lun, shost->bqt, shost->hostt->tag_alloc_policy); } blk-mq has always had a fairness algorithm in case a tag set is shared across request queues. If a tag set is shared across request queues, the number of tags per request queue is restricted to the total number of tags divided by the number of users (ignoring rounding). From block/blk-mq.c in the latest kernel: depth = max((bt->sb.depth + users - 1) / users, 4U); What my patch does is to remove this fairness guarantee. There was no equivalent of this fairness guarantee in the legacy block layer. Thanks, Bart.
On 10/18/23 11:00, Bart Van Assche wrote: > -/* > - * For shared tag users, we track the number of currently active users > - * and attempt to provide a fair share of the tag depth for each of them. > - */ > -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, > - struct sbitmap_queue *bt) > -{ > - unsigned int depth, users; > - > - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) > - return true; > - > - /* > - * Don't try dividing an ant > - */ > - if (bt->sb.depth == 1) > - return true; > - > - if (blk_mq_is_shared_tags(hctx->flags)) { > - struct request_queue *q = hctx->queue; > - > - if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) > - return true; > - } else { > - if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) > - return true; > - } > - > - users = READ_ONCE(hctx->tags->active_queues); > - if (!users) > - return true; > - > - /* > - * Allow at least some tags > - */ > - depth = max((bt->sb.depth + users - 1) / users, 4U); > - return __blk_mq_active_requests(hctx) < depth; > -} (replying to my own email) Hi Jens, Do you perhaps remember why the above code was introduced by commit 0d2602ca30e4 ("blk-mq: improve support for shared tags maps") in 2014 (nine years ago)? Thanks, Bart.
On Fri, Oct 20, 2023 at 10:54:59AM -0700, Bart Van Assche wrote: > > On 10/20/23 10:09, Keith Busch wrote: > > On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote: > > > > > > On 10/20/23 09:25, Keith Busch wrote: > > > > The legacy block request layer didn't have a tag resource shared among > > > > multiple request queues. Each queue had their own mempool for allocating > > > > requests. The mempool, I think, would always guarantee everyone could > > > > get at least one request. > > > > > > I think that the above is irrelevant in this context. As an example, SCSI > > > devices have always shared a pool of tags across multiple logical > > > units. This behavior has not been changed by the conversion of the > > > SCSI core from the legacy block layer to blk-mq. > > > > > > For other (hardware) block devices it didn't matter either that there > > > was no upper limit to the number of requests the legacy block layer > > > could allocate. All hardware block devices I know support fixed size > > > queues for queuing requests to the block device. > > > > I am not sure I understand your point. Those lower layers always were > > able to get at least one request per request_queue. They can do whatever > > they want with it after that. This change removes that guarantee for > > blk-mq in some cases, right? I just don't think you can readily conclude > > that is "safe" by appealing to the legacy behavior, that's all. > > Hi Keith, > > How requests were allocated in the legacy block layer is irrelevant in > this context. The patch I posted affects the tag allocation strategy. > Tag allocation happened in the legacy block layer by calling > blk_queue_start_tag(). From Linux kernel v4.20: > > /** > * blk_queue_start_tag - find a free tag and assign it > * @q: the request queue for the device > * @rq: the block request that needs tagging > * [ ... ] > **/ > > That function supports sharing tags between request queues but did not > attempt to be fair at all. This is how the SCSI core in Linux kernel v4.20 > sets up tag sharing between request queues (from drivers/scsi/scsi_scan.c): > > if (!shost_use_blk_mq(sdev->host)) { > blk_queue_init_tags(sdev->request_queue, > sdev->host->cmd_per_lun, shost->bqt, > shost->hostt->tag_alloc_policy); > } > > blk-mq has always had a fairness algorithm in case a tag set is shared > across request queues. If a tag set is shared across request queues, the > number of tags per request queue is restricted to the total number of > tags divided by the number of users (ignoring rounding). From > block/blk-mq.c in the latest kernel: > > depth = max((bt->sb.depth + users - 1) / users, 4U); > > What my patch does is to remove this fairness guarantee. There was no > equivalent of this fairness guarantee in the legacy block layer. If two LUNs are attached to same host, one is slow, and another is fast, and the slow LUN can slow down the fast LUN easily without this fairness algorithm. Your motivation is that "One of these logical units (WLUN) is used to submit control commands, e.g. START STOP UNIT. If any request is submitted to the WLUN, the queue depth is reduced from 31 to 15 or lower for data LUNs." I guess one simple fix is to not account queues of this non-IO LUN as active queues? Thanks, Ming
Hi, 在 2023/10/19 2:00, Bart Van Assche 写道: > Remove the code for fair tag sharing because it significantly hurts > performance for UFS devices. Removing this code is safe because the > legacy block layer worked fine without any equivalent fairness > algorithm. > > This algorithm hurts performance for UFS devices because UFS devices > have multiple logical units. One of these logical units (WLUN) is used > to submit control commands, e.g. START STOP UNIT. If any request is > submitted to the WLUN, the queue depth is reduced from 31 to 15 or > lower for data LUNs. > > See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/ > > Note: it has been attempted to rework this algorithm. See also "[PATCH > RFC 0/7] blk-mq: improve tag fair sharing" > (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/). > Given the complexity of that patch series, I do not expect that patch > series to be merged. Sorry for such huge delay, I was struggled on implementing a smoothly algorithm to borrow tags and return borrowed tags, and later I put this on ice and focus on other stuff. I had an idea to implement a state machine, however, the amount of code was aggressive and I gave up. And later, I implemented a simple version, and I tested it in your case, 32 tags and 2 shared node, result looks good(see below), however, I'm not confident this can work well general. Anyway, I'll send a new RFC verion for this, and please let me know if you still think this approch is unacceptable. Thanks, Kuai Test script: [global] ioengine=libaio iodepth=2 bs=4k direct=1 rw=randrw group_reporting [sda] numjobs=32 filename=/dev/sda [sdb] numjobs=1 filename=/dev/sdb Test result, by monitor new debugfs entry shared_tag_info: time active available sda sdb sda sdb 0 0 0 32 32 1 16 2 16 16 -> start fair sharing 2 19 2 20 16 3 24 2 24 16 4 26 2 28 16 -> borrow 32/8=4 tags each round 5 28 2 28 16 -> save at lease 4 tags for sdb ... > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Martin K. Petersen <martin.petersen@oracle.com> > Cc: Ming Lei <ming.lei@redhat.com> > Cc: Keith Busch <kbusch@kernel.org> > Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> > Cc: Yu Kuai <yukuai1@huaweicloud.com> > Cc: Ed Tsai <ed.tsai@mediatek.com> > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > --- > block/blk-mq-tag.c | 4 ---- > block/blk-mq.c | 3 --- > block/blk-mq.h | 39 --------------------------------------- > 3 files changed, 46 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index cc57e2dd9a0b..25334bfcabf8 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) > static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, > struct sbitmap_queue *bt) > { > - if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) && > - !hctx_may_queue(data->hctx, bt)) > - return BLK_MQ_NO_TAG; > - > if (data->shallow_depth) > return sbitmap_queue_get_shallow(bt, data->shallow_depth); > else > diff --git a/block/blk-mq.c b/block/blk-mq.c > index e2d11183f62e..502dafa76716 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1760,9 +1760,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq) > if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) { > bt = &rq->mq_hctx->tags->breserved_tags; > tag_offset = 0; > - } else { > - if (!hctx_may_queue(rq->mq_hctx, bt)) > - return false; > } > > tag = __sbitmap_queue_get(bt); > diff --git a/block/blk-mq.h b/block/blk-mq.h > index f75a9ecfebde..14a22f6d3fdf 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -407,45 +407,6 @@ static inline void blk_mq_free_requests(struct list_head *list) > } > } > > -/* > - * For shared tag users, we track the number of currently active users > - * and attempt to provide a fair share of the tag depth for each of them. > - */ > -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, > - struct sbitmap_queue *bt) > -{ > - unsigned int depth, users; > - > - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) > - return true; > - > - /* > - * Don't try dividing an ant > - */ > - if (bt->sb.depth == 1) > - return true; > - > - if (blk_mq_is_shared_tags(hctx->flags)) { > - struct request_queue *q = hctx->queue; > - > - if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) > - return true; > - } else { > - if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) > - return true; > - } > - > - users = READ_ONCE(hctx->tags->active_queues); > - if (!users) > - return true; > - > - /* > - * Allow at least some tags > - */ > - depth = max((bt->sb.depth + users - 1) / users, 4U); > - return __blk_mq_active_requests(hctx) < depth; > -} > - > /* run the code block in @dispatch_ops with rcu/srcu read lock held */ > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > do { \ > > . >
On 10/20/23 18:31, Ming Lei wrote: > If two LUNs are attached to same host, one is slow, and another is fast, > and the slow LUN can slow down the fast LUN easily without this fairness > algorithm. > > Your motivation is that "One of these logical units (WLUN) is used > to submit control commands, e.g. START STOP UNIT. If any request is > submitted to the WLUN, the queue depth is reduced from 31 to 15 or > lower for data LUNs." I guess one simple fix is to not account queues > of this non-IO LUN as active queues? Hi Ming, For fast storage devices (e.g. UFS) any time spent in an algorithm for fair sharing will reduce IOPS. If there are big differences in the request processing latency between different request queues then fair sharing is beneficial. Whether or not the fair sharing algorithm is improved, how about making it easy to disable fair sharing, e.g. with something like the untested patch below? I think that will work better than ignoring fair sharing per LUN. UFS devices support multiple logical units and with the current fair sharing approach it takes long until tags are taken away from an inactive LUN (request queue timeout). Thanks, Bart. diff --git a/block/blk-mq.h b/block/blk-mq.h index f75a9ecfebde..b06b161d06de 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, { unsigned int depth, users; - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) + if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || + hctx->queue->disable_fair_sharing) return true; /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index eef450f25982..63b04cf65887 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -523,6 +523,7 @@ struct request_queue { struct mutex debugfs_mutex; bool mq_sysfs_init_done; + bool disable_fair_sharing; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */
On 10/21/23 00:32, Yu Kuai wrote: > Sorry for such huge delay, I was struggled on implementing a smoothly > algorithm to borrow tags and return borrowed tags, and later I put this > on ice and focus on other stuff. > > I had an idea to implement a state machine, however, the amount of code > was aggressive and I gave up. And later, I implemented a simple version, > and I tested it in your case, 32 tags and 2 shared node, result looks > good(see below), however, I'm not confident this can work well general. > > Anyway, I'll send a new RFC verion for this, and please let me know if > you still think this approch is unacceptable. > > Thanks, > Kuai > > Test script: > > [global] > ioengine=libaio > iodepth=2 > bs=4k > direct=1 > rw=randrw > group_reporting > > [sda] > numjobs=32 > filename=/dev/sda > > [sdb] > numjobs=1 > filename=/dev/sdb > > Test result, by monitor new debugfs entry shared_tag_info: > time active available > sda sdb sda sdb > 0 0 0 32 32 > 1 16 2 16 16 -> start fair sharing > 2 19 2 20 16 > 3 24 2 24 16 > 4 26 2 28 16 -> borrow 32/8=4 tags each round > 5 28 2 28 16 -> save at lease 4 tags for sdb Hi Yu, Thank you for having shared these results. What is the unit of the numbers in the time column? In the above I see that more tags are assigned to sda than to sdb although I/O is being submitted to both LUNs. I think the current algoritm defines fairness as dividing tags in a fair way across active LUNs. Do the above results show that tags are divided per active job instead of per active LUN? If so, I'm not sure that everyone will agree that this is a fair way to distribute tags ... Thanks, Bart.
Hi, 在 2023/10/22 0:21, Bart Van Assche 写道: > > On 10/21/23 00:32, Yu Kuai wrote: >> Sorry for such huge delay, I was struggled on implementing a smoothly >> algorithm to borrow tags and return borrowed tags, and later I put this >> on ice and focus on other stuff. >> >> I had an idea to implement a state machine, however, the amount of code >> was aggressive and I gave up. And later, I implemented a simple version, >> and I tested it in your case, 32 tags and 2 shared node, result looks >> good(see below), however, I'm not confident this can work well general. >> >> Anyway, I'll send a new RFC verion for this, and please let me know if >> you still think this approch is unacceptable. >> >> Thanks, >> Kuai >> >> Test script: >> >> [global] >> ioengine=libaio >> iodepth=2 >> bs=4k >> direct=1 >> rw=randrw >> group_reporting >> >> [sda] >> numjobs=32 >> filename=/dev/sda >> >> [sdb] >> numjobs=1 >> filename=/dev/sdb >> >> Test result, by monitor new debugfs entry shared_tag_info: >> time active available >> sda sdb sda sdb >> 0 0 0 32 32 >> 1 16 2 16 16 -> start fair sharing >> 2 19 2 20 16 >> 3 24 2 24 16 >> 4 26 2 28 16 -> borrow 32/8=4 tags each round >> 5 28 2 28 16 -> save at lease 4 tags for sdb > > Hi Yu, > > Thank you for having shared these results. What is the unit of the > numbers in the time column? I added a timer in blk_mq_tags, and timer function is used to implement borrow tags and return borrowed tags, the timer will start when one node is busy, and will expire in HZ, so the time means each second. > > In the above I see that more tags are assigned to sda than to sdb > although I/O is being submitted to both LUNs. I think the current > algoritm defines fairness as dividing tags in a fair way across active > LUNs. Do the above results show that tags are divided per active job > instead of per active LUN? If so, I'm not sure that everyone will agree > that this is a fair way to distribute tags ... Yes, active tag is divided into per active LUN, specifically each request_queue or hctx that is sharing tags. Thanks, Kuai > > Thanks, > > Bart. > > > . >
On Sat, Oct 21, 2023 at 09:13:38AM -0700, Bart Van Assche wrote: > On 10/20/23 18:31, Ming Lei wrote: > > If two LUNs are attached to same host, one is slow, and another is fast, > > and the slow LUN can slow down the fast LUN easily without this fairness > > algorithm. > > > > Your motivation is that "One of these logical units (WLUN) is used > > to submit control commands, e.g. START STOP UNIT. If any request is > > submitted to the WLUN, the queue depth is reduced from 31 to 15 or > > lower for data LUNs." I guess one simple fix is to not account queues > > of this non-IO LUN as active queues? > > Hi Ming, > > For fast storage devices (e.g. UFS) any time spent in an algorithm for > fair sharing will reduce IOPS. If there are big differences in the > request processing latency between different request queues then fair > sharing is beneficial. Whether or not the fair sharing algorithm is > improved, how about making it easy to disable fair sharing, e.g. with > something like the untested patch below? I think that will work better > than ignoring fair sharing per LUN. UFS devices support multiple logical > units and with the current fair sharing approach it takes long until > tags are taken away from an inactive LUN (request queue timeout). > > Thanks, > > Bart. > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index f75a9ecfebde..b06b161d06de 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx > *hctx, > { > unsigned int depth, users; > > - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) > + if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || > + hctx->queue->disable_fair_sharing) Maybe you can propagate this flag into hctx->flags, then hctx->queue->disable_fair_sharing can be avoided in fast path. > return true; > > /* > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index eef450f25982..63b04cf65887 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -523,6 +523,7 @@ struct request_queue { > struct mutex debugfs_mutex; > > bool mq_sysfs_init_done; > + bool disable_fair_sharing; You also need to bypass blk_mq_tag_busy() & blk_mq_tag_idle() in case of disable_fair_sharing which should only be set for non-IO queues, such as UFS WLUN, and maybe nvme connection queues. Thanks, Ming
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index cc57e2dd9a0b..25334bfcabf8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, struct sbitmap_queue *bt) { - if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) && - !hctx_may_queue(data->hctx, bt)) - return BLK_MQ_NO_TAG; - if (data->shallow_depth) return sbitmap_queue_get_shallow(bt, data->shallow_depth); else diff --git a/block/blk-mq.c b/block/blk-mq.c index e2d11183f62e..502dafa76716 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1760,9 +1760,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq) if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) { bt = &rq->mq_hctx->tags->breserved_tags; tag_offset = 0; - } else { - if (!hctx_may_queue(rq->mq_hctx, bt)) - return false; } tag = __sbitmap_queue_get(bt); diff --git a/block/blk-mq.h b/block/blk-mq.h index f75a9ecfebde..14a22f6d3fdf 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -407,45 +407,6 @@ static inline void blk_mq_free_requests(struct list_head *list) } } -/* - * For shared tag users, we track the number of currently active users - * and attempt to provide a fair share of the tag depth for each of them. - */ -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, - struct sbitmap_queue *bt) -{ - unsigned int depth, users; - - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) - return true; - - /* - * Don't try dividing an ant - */ - if (bt->sb.depth == 1) - return true; - - if (blk_mq_is_shared_tags(hctx->flags)) { - struct request_queue *q = hctx->queue; - - if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) - return true; - } else { - if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) - return true; - } - - users = READ_ONCE(hctx->tags->active_queues); - if (!users) - return true; - - /* - * Allow at least some tags - */ - depth = max((bt->sb.depth + users - 1) / users, 4U); - return __blk_mq_active_requests(hctx) < depth; -} - /* run the code block in @dispatch_ops with rcu/srcu read lock held */ #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \
Remove the code for fair tag sharing because it significantly hurts performance for UFS devices. Removing this code is safe because the legacy block layer worked fine without any equivalent fairness algorithm. This algorithm hurts performance for UFS devices because UFS devices have multiple logical units. One of these logical units (WLUN) is used to submit control commands, e.g. START STOP UNIT. If any request is submitted to the WLUN, the queue depth is reduced from 31 to 15 or lower for data LUNs. See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/ Note: it has been attempted to rework this algorithm. See also "[PATCH RFC 0/7] blk-mq: improve tag fair sharing" (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/). Given the complexity of that patch series, I do not expect that patch series to be merged. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-mq-tag.c | 4 ---- block/blk-mq.c | 3 --- block/blk-mq.h | 39 --------------------------------------- 3 files changed, 46 deletions(-)