diff mbox series

[v2] block: Improve shared tag set performance

Message ID 20230103195337.158625-1-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series [v2] block: Improve shared tag set performance | expand

Commit Message

Bart Van Assche Jan. 3, 2023, 7:53 p.m. UTC
Remove the code for fair tag sharing because it significantly hurts
performance for UFS devices. Removing this code is safe because the
legacy block layer worked fine without any equivalent fairness
algorithm.

This algorithm hurts performance for UFS devices because UFS devices
have multiple logical units. One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs.

See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
Changes compared to v1: restored the tags->active_queues variable and thereby
  fixed the "uninitialized variable" warning reported by the kernel test robot.

 block/blk-mq-tag.c |  4 ----
 block/blk-mq.c     |  3 ---
 block/blk-mq.h     | 40 ----------------------------------------
 3 files changed, 47 deletions(-)

Comments

Bart Van Assche Jan. 18, 2023, 9:20 p.m. UTC | #1
On 1/3/23 11:53, Bart Van Assche wrote:
> Remove the code for fair tag sharing because it significantly hurts
> performance for UFS devices. Removing this code is safe because the
> legacy block layer worked fine without any equivalent fairness
> algorithm.
> 
> This algorithm hurts performance for UFS devices because UFS devices
> have multiple logical units. One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs.

Can anyone please help with reviewing this patch?

Thanks,

Bart.
Keith Busch Jan. 18, 2023, 10:14 p.m. UTC | #2
On Tue, Jan 03, 2023 at 11:53:37AM -0800, Bart Van Assche wrote:
> Remove the code for fair tag sharing because it significantly hurts
> performance for UFS devices. Removing this code is safe because the
> legacy block layer worked fine without any equivalent fairness
> algorithm.

The legacy block layer didn't have a limited tag resource that can block
request allocations though, so I don't think we can appeal to that
example to conclude removing this is safe. As much as I'd like to get
rid of this code, I honestly have no idea if this isn't helping anyone.
 
> This algorithm hurts performance for UFS devices because UFS devices
> have multiple logical units. One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs.

Can you give the WLUN it's own tagset instead?
Bart Van Assche Jan. 18, 2023, 10:46 p.m. UTC | #3
On 1/18/23 14:14, Keith Busch wrote:
> On Tue, Jan 03, 2023 at 11:53:37AM -0800, Bart Van Assche wrote:
>> This algorithm hurts performance for UFS devices because UFS devices
>> have multiple logical units. One of these logical units (WLUN) is used
>> to submit control commands, e.g. START STOP UNIT. If any request is
>> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
>> lower for data LUNs.
> 
> Can you give the WLUN it's own tagset instead?

Hi Keith,

Does that mean creating an additional tagset for the WLUN and statically 
allocating all tags from the WLUN tagset that are used by other LUNs? 
That approach would have the following disadvantages:
- The maximum queue depth for other LUNs would be reduced from 31 to 30.
   This would have a small but noticeable negative performance impact.
- The code removed by this patch negatively impacts performance of all
   SCSI hosts with two or more data LUNs and also of all NVMe controllers
   that export two or more namespaces if there are significant
   differences in the number of I/O operations per second for different
   LUNs/namespaces. This is why I think that this should be solved in the
   block layer instead of modifying each block driver individually.

Thanks,

Bart.
Bart Van Assche Jan. 19, 2023, 5:48 p.m. UTC | #4
On 1/18/23 14:46, Bart Van Assche wrote:
> On 1/18/23 14:14, Keith Busch wrote:
>> On Tue, Jan 03, 2023 at 11:53:37AM -0800, Bart Van Assche wrote:
>>> This algorithm hurts performance for UFS devices because UFS devices
>>> have multiple logical units. One of these logical units (WLUN) is used
>>> to submit control commands, e.g. START STOP UNIT. If any request is
>>> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
>>> lower for data LUNs.
>>
>> Can you give the WLUN it's own tagset instead?
> 
> Hi Keith,
> 
> Does that mean creating an additional tagset for the WLUN and statically 
> allocating all tags from the WLUN tagset that are used by other LUNs? 
> That approach would have the following disadvantages:
> - The maximum queue depth for other LUNs would be reduced from 31 to 30.
>    This would have a small but noticeable negative performance impact.
> - The code removed by this patch negatively impacts performance of all
>    SCSI hosts with two or more data LUNs and also of all NVMe controllers
>    that export two or more namespaces if there are significant
>    differences in the number of I/O operations per second for different
>    LUNs/namespaces. This is why I think that this should be solved in the
>    block layer instead of modifying each block driver individually.

(replying to my own e-mail)

Hi Keith,

An additional concern about the approach of using multiple tag sets per 
SCSI host: this approach is not compatible with any code that iterates 
over tag sets. Examples of functions that iterate over tag sets are 
blk_mq_tagset_busy_iter(), scsi_host_busy(), 
scsi_host_complete_all_commands() and scsi_host_busy_iter(). 
scsi_host_busy() is used by e.g. scsi_host_queue_ready() and 
scsi_eh_wakeup(). So I think this approach would break the SCSI core in 
a subtle way.

Bart.
Keith Busch Jan. 19, 2023, 6:47 p.m. UTC | #5
On Wed, Jan 18, 2023 at 02:46:38PM -0800, Bart Van Assche wrote:
> - The code removed by this patch negatively impacts performance of all
>   SCSI hosts with two or more data LUNs and also of all NVMe controllers
>   that export two or more namespaces if there are significant
>   differences in the number of I/O operations per second for different
>   LUNs/namespaces. This is why I think that this should be solved in the
>   block layer instead of modifying each block driver individually.

I think the potential performance sacrifice was done on purpose in order
to guarantee all tagset subscribers can have at least one unblocked
access to the pool, otherwise one LUN could starve all the others by
tying up all the tags in long-running operations. Is anyone actually
benefiting from the current sharing, though? I really do not know, so
I'm having trouble weighing in on removing it.
Bart Van Assche Jan. 19, 2023, 7:28 p.m. UTC | #6
On 1/19/23 10:47, Keith Busch wrote:
> I think the potential performance sacrifice was done on purpose in order
> to guarantee all tagset subscribers can have at least one unblocked
> access to the pool, otherwise one LUN could starve all the others by
> tying up all the tags in long-running operations. Is anyone actually
> benefiting from the current sharing, though? I really do not know, so
> I'm having trouble weighing in on removing it.

Hi Keith,

Hmm ... does reserving tags for one logical unit / namespace really help 
if long-running operations are active on another logical unit or 
namespace? For storage devices where the storage medium is shared across 
logical units or namespaces reserving tags won't help if the storage 
controller is not able to interrupt ongoing medium accesses. For storage 
devices with multiple "spindles" (e.g. a NAS) the tag fairness algorithm 
may help but this is not guaranteed.

How about making it configurable (request queue flag?) whether the tag 
fairness algorithm is active and leaving this algorithm disabled by default?

Thanks,

Bart.
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9eb968e14d31..20d37d98ccb9 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -98,10 +98,6 @@  void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
-	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
-			!hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
-
 	if (data->shallow_depth)
 		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c5cf0dbca1db..d0e07d83c23d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1768,9 +1768,6 @@  static bool __blk_mq_alloc_driver_tag(struct request *rq)
 	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
 		bt = &rq->mq_hctx->tags->breserved_tags;
 		tag_offset = 0;
-	} else {
-		if (!hctx_may_queue(rq->mq_hctx, bt))
-			return false;
 	}
 
 	tag = __sbitmap_queue_get(bt);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef59fee62780..7b27d1d1bf51 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -334,46 +334,6 @@  static inline void blk_mq_free_requests(struct list_head *list)
 	}
 }
 
-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
-				  struct sbitmap_queue *bt)
-{
-	unsigned int depth, users;
-
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-		return true;
-
-	/*
-	 * Don't try dividing an ant
-	 */
-	if (bt->sb.depth == 1)
-		return true;
-
-	if (blk_mq_is_shared_tags(hctx->flags)) {
-		struct request_queue *q = hctx->queue;
-
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			return true;
-	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			return true;
-	}
-
-	users = atomic_read(&hctx->tags->active_queues);
-
-	if (!users)
-		return true;
-
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
-}
-
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\