diff mbox series

[1/9] blk-mq: allow hw queues to share hostwide tags

Message ID 20190531022801.10003-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq/scsi: convert private reply queue into blk_mq hw queue | expand

Commit Message

Ming Lei May 31, 2019, 2:27 a.m. UTC
Some SCSI HBAs(such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags, and the reply queue
is used for delievery & complete request, and one MSI-X vector is
assigned to each reply queue.

Now drivers have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment. Given there is only single blk-mq hw
queue, these drivers have to setup private reply queue mapping for
figuring out which reply queue is selected for delivery request, and
the queue mapping is based on managed IRQ affinity, and it is generic,
should have been done inside blk-mq.

Based on the following Hannes's patch, introduce BLK_MQ_F_HOST_TAGS for
converting reply queue into blk-mq hw queue.

	https://marc.info/?l=linux-block&m=149132580511346&w=2

Once driver sets BLK_MQ_F_HOST_TAGS, the hostwide tags & request pool is
shared among all blk-mq hw queues.

The following patches will map driver's reply queue into blk-mq hw queue
by applying BLK_MQ_F_HOST_TAGS.

Compared with the current implementation by single hw queue, performance
shouldn't be affected by this patch in theory.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  8 ++++++++
 block/blk-mq-tag.c     |  6 ++++++
 block/blk-mq.c         | 14 ++++++++++++++
 block/elevator.c       |  5 +++--
 include/linux/blk-mq.h |  1 +
 6 files changed, 33 insertions(+), 2 deletions(-)

Comments

Hannes Reinecke May 31, 2019, 6:07 a.m. UTC | #1
On 5/31/19 4:27 AM, Ming Lei wrote:
> Some SCSI HBAs(such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags, and the reply queue
> is used for delievery & complete request, and one MSI-X vector is
> assigned to each reply queue.
> 
> Now drivers have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> for automatic affinity assignment. Given there is only single blk-mq hw
> queue, these drivers have to setup private reply queue mapping for
> figuring out which reply queue is selected for delivery request, and
> the queue mapping is based on managed IRQ affinity, and it is generic,
> should have been done inside blk-mq.
> 
> Based on the following Hannes's patch, introduce BLK_MQ_F_HOST_TAGS for
> converting reply queue into blk-mq hw queue.
> 
> 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> 
> Once driver sets BLK_MQ_F_HOST_TAGS, the hostwide tags & request pool is
> shared among all blk-mq hw queues.
> 
> The following patches will map driver's reply queue into blk-mq hw queue
> by applying BLK_MQ_F_HOST_TAGS.
> 
> Compared with the current implementation by single hw queue, performance
> shouldn't be affected by this patch in theory.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   |  8 ++++++++
>  block/blk-mq-tag.c     |  6 ++++++
>  block/blk-mq.c         | 14 ++++++++++++++
>  block/elevator.c       |  5 +++--
>  include/linux/blk-mq.h |  1 +
>  6 files changed, 33 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Bart Van Assche May 31, 2019, 3:37 p.m. UTC | #2
On 5/30/19 7:27 PM, Ming Lei wrote:
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 6aea0ebc3a73..3d6780504dcb 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -237,6 +237,7 @@ static const char *const alloc_policy_name[] = {
>   static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(SHOULD_MERGE),
>   	HCTX_FLAG_NAME(TAG_SHARED),
> +	HCTX_FLAG_NAME(HOST_TAGS),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
>   };

The name BLK_MQ_F_HOST_TAGS suggests that tags are shared across a SCSI 
host. That is misleading since this flag means that tags are shared 
across hardware queues. Additionally, the "host" term is a term that 
comes from the SCSI world and this patch is a block layer patch. That 
makes me wonder whether another name should be used to reflect that all 
hardware queues share the same tag set? How about renaming 
BLK_MQ_F_TAG_SHARED into BLK_MQ_F_TAG_QUEUE_SHARED and renaming 
BLK_MQ_F_HOST_TAGS into BLK_MQ_F_TAG_HCTX_SHARED?

Bart.
John Garry June 5, 2019, 2:10 p.m. UTC | #3
On 31/05/2019 03:27, Ming Lei wrote:
> index 32b8ad3d341b..49d73d979cb3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2433,6 +2433,11 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>  {
>  	int ret = 0;
>

Hi Ming,

> +	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx) {
> +		set->tags[hctx_idx] = set->tags[0];

Here we set all tags same as that of hctx index 0.

> +		return true;


As such, I think that the error handling in __blk_mq_alloc_rq_maps() is 
made a little fragile:

__blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
{
	int i;

	for (i = 0; i < set->nr_hw_queues; i++)
		if (!__blk_mq_alloc_rq_map(set, i))
			goto out_unwind;

	return 0;

out_unwind:
	while (--i >= 0)
		blk_mq_free_rq_map(set->tags[i]);

	return -ENOMEM;
}

If __blk_mq_alloc_rq_map(, i > 1) fails for when BLK_MQ_F_HOST_TAGS FLAG 
is set (even though today it can't), then we would try to free 
set->tags[0] multiple times.

> +	}
> +
>  	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
>  					set->queue_depth, set->reserved_tags);

Thanks,
John

>  	if (!set->tags[hctx_idx])
> @@ -2451,6 +2456,9 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>  static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
>
Ming Lei June 24, 2019, 8:44 a.m. UTC | #4
On Fri, May 31, 2019 at 08:37:39AM -0700, Bart Van Assche wrote:
> On 5/30/19 7:27 PM, Ming Lei wrote:
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 6aea0ebc3a73..3d6780504dcb 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -237,6 +237,7 @@ static const char *const alloc_policy_name[] = {
> >   static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(SHOULD_MERGE),
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> > +	HCTX_FLAG_NAME(HOST_TAGS),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> >   };
> 
> The name BLK_MQ_F_HOST_TAGS suggests that tags are shared across a SCSI
> host. That is misleading since this flag means that tags are shared across
> hardware queues. Additionally, the "host" term is a term that comes from the
> SCSI world and this patch is a block layer patch. That makes me wonder
> whether another name should be used to reflect that all hardware queues
> share the same tag set? How about renaming BLK_MQ_F_TAG_SHARED into
> BLK_MQ_F_TAG_QUEUE_SHARED and renaming BLK_MQ_F_HOST_TAGS into
> BLK_MQ_F_TAG_HCTX_SHARED?

Looks fine, will do it in V2.

Thanks,
Ming
Ming Lei June 24, 2019, 8:46 a.m. UTC | #5
On Wed, Jun 05, 2019 at 03:10:51PM +0100, John Garry wrote:
> On 31/05/2019 03:27, Ming Lei wrote:
> > index 32b8ad3d341b..49d73d979cb3 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2433,6 +2433,11 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
> >  {
> >  	int ret = 0;
> > 
> 
> Hi Ming,
> 
> > +	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx) {
> > +		set->tags[hctx_idx] = set->tags[0];
> 
> Here we set all tags same as that of hctx index 0.
> 
> > +		return true;
> 
> 
> As such, I think that the error handling in __blk_mq_alloc_rq_maps() is made
> a little fragile:
> 
> __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
> {
> 	int i;
> 
> 	for (i = 0; i < set->nr_hw_queues; i++)
> 		if (!__blk_mq_alloc_rq_map(set, i))
> 			goto out_unwind;
> 
> 	return 0;
> 
> out_unwind:
> 	while (--i >= 0)
> 		blk_mq_free_rq_map(set->tags[i]);
> 
> 	return -ENOMEM;
> }
> 
> If __blk_mq_alloc_rq_map(, i > 1) fails for when BLK_MQ_F_HOST_TAGS FLAG is
> set (even though today it can't), then we would try to free set->tags[0]
> multiple times.

Good catch, and the issue can be addressed easily by setting set->hctx[i] as
NULL, then check 'tags' in blk_mq_free_rq_map().

Thanks,
Ming
John Garry June 24, 2019, 1:14 p.m. UTC | #6
On 24/06/2019 09:46, Ming Lei wrote:
> On Wed, Jun 05, 2019 at 03:10:51PM +0100, John Garry wrote:
>> On 31/05/2019 03:27, Ming Lei wrote:
>>> index 32b8ad3d341b..49d73d979cb3 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -2433,6 +2433,11 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>>>  {
>>>  	int ret = 0;
>>>
>>
>> Hi Ming,
>>
>>> +	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx) {
>>> +		set->tags[hctx_idx] = set->tags[0];
>>
>> Here we set all tags same as that of hctx index 0.
>>
>>> +		return true;
>>
>>
>> As such, I think that the error handling in __blk_mq_alloc_rq_maps() is made
>> a little fragile:
>>
>> __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
>> {
>> 	int i;
>>
>> 	for (i = 0; i < set->nr_hw_queues; i++)
>> 		if (!__blk_mq_alloc_rq_map(set, i))
>> 			goto out_unwind;
>>
>> 	return 0;
>>
>> out_unwind:
>> 	while (--i >= 0)
>> 		blk_mq_free_rq_map(set->tags[i]);
>>
>> 	return -ENOMEM;
>> }
>>
>> If __blk_mq_alloc_rq_map(, i > 1) fails for when BLK_MQ_F_HOST_TAGS FLAG is
>> set (even though today it can't), then we would try to free set->tags[0]
>> multiple times.
>

Hi Ming,

> Good catch, and the issue can be addressed easily by setting set->hctx[i] as
> NULL, then check 'tags' in blk_mq_free_rq_map().

OK, so you could do that. But I just think that it's not a great 
practice in general to have multiple pointers pointing at the same 
dynamic memory.

Thanks,
John

>
> Thanks,
> Ming
>
> .
>
diff mbox series

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6aea0ebc3a73..3d6780504dcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -237,6 +237,7 @@  static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
 	HCTX_FLAG_NAME(TAG_SHARED),
+	HCTX_FLAG_NAME(HOST_TAGS),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74c6bb871f7e..3a4d9ad63e7b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -449,6 +449,9 @@  static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
 {
+	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx)
+		return;
+
 	if (hctx->sched_tags) {
 		blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx);
 		blk_mq_free_rq_map(hctx->sched_tags);
@@ -463,6 +466,11 @@  static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	struct blk_mq_tag_set *set = q->tag_set;
 	int ret;
 
+	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx) {
+		hctx->sched_tags = q->queue_hw_ctx[0]->sched_tags;
+		return 0;
+	}
+
 	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
 					       set->reserved_tags);
 	if (!hctx->sched_tags)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7513c8eaabee..309ec5079f3f 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -358,6 +358,9 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+
+		if (tagset->flags & BLK_MQ_F_HOST_TAGS)
+			break;
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -405,6 +408,9 @@  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		if (tags->nr_reserved_tags)
 			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
 		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+
+		if (hctx->flags & BLK_MQ_F_HOST_TAGS)
+			break;
 	}
 	blk_queue_exit(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32b8ad3d341b..49d73d979cb3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2433,6 +2433,11 @@  static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
 {
 	int ret = 0;
 
+	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx) {
+		set->tags[hctx_idx] = set->tags[0];
+		return true;
+	}
+
 	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
 					set->queue_depth, set->reserved_tags);
 	if (!set->tags[hctx_idx])
@@ -2451,6 +2456,9 @@  static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
 static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
 					 unsigned int hctx_idx)
 {
+	if ((set->flags & BLK_MQ_F_HOST_TAGS) && hctx_idx)
+		return;
+
 	if (set->tags && set->tags[hctx_idx]) {
 		blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
 		blk_mq_free_rq_map(set->tags[hctx_idx]);
@@ -3166,6 +3174,12 @@  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		}
 		if (ret)
 			break;
+
+		if (set->flags & BLK_MQ_F_HOST_TAGS)
+			break;
+	}
+
+	queue_for_each_hw_ctx(q, hctx, i) {
 		if (q->elevator && q->elevator->type->ops.depth_updated)
 			q->elevator->type->ops.depth_updated(hctx);
 	}
diff --git a/block/elevator.c b/block/elevator.c
index ec55d5fc0b3e..ed553d9bc53e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -596,7 +596,8 @@  int elevator_switch_mq(struct request_queue *q,
 
 /*
  * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices.  If deadline isn't available OR we have multiple queues,
+ * queue devices or multiple queue device with hostwide tags.  If deadline isn't
+ * available OR we have multiple queues,
  * default to "none".
  */
 int elevator_init_mq(struct request_queue *q)
@@ -604,7 +605,7 @@  int elevator_init_mq(struct request_queue *q)
 	struct elevator_type *e;
 	int err = 0;
 
-	if (q->nr_hw_queues != 1)
+	if (q->nr_hw_queues != 1 && !(q->tag_set->flags & BLK_MQ_F_HOST_TAGS))
 		return 0;
 
 	/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15d1aa53d96c..b4e33b509229 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -219,6 +219,7 @@  struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_HOST_TAGS	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,