diff mbox

[3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough

Message ID 20170428151539.25514-4-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei April 28, 2017, 3:15 p.m. UTC
When tag space of one device is big enough, we use hw tag
directly for I/O scheduling.

Now the decision is made if hw queue depth is not less than
q->nr_requests and the tag set isn't shared.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  8 ++++++++
 block/blk-mq-sched.h | 15 +++++++++++++++
 block/blk-mq.c       | 18 +++++++++++++++++-
 3 files changed, 40 insertions(+), 1 deletion(-)

Comments

Bart Van Assche April 28, 2017, 6:09 p.m. UTC | #1
On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> +{
> +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> +		return false;
> +
> +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> +		return false;
> +
> +	return true;
> +}

The only user of shared tag sets I know of is scsi-mq. I think it's really
unfortunate that this patch systematically disables BLK_MQ_F_SCHED_USE_HW_TAG
for scsi-mq.

>  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
> @@ -2681,9 +2694,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  			break;
>  	}
>  
> -	if (!ret)
> +	if (!ret) {
>  		q->nr_requests = nr;
>  
> +		blk_mq_update_sched_flag(q);
> +	}
> +
>  	blk_mq_unfreeze_queue(q);
>  	blk_mq_start_stopped_hw_queues(q, true);

If a queue is created with a low value of nr_requests that will cause
blk_mq_sched_alloc_tags() to skip allocation of .sched_tags. If nr_requests
is increased, can that cause this function to clear BLK_MQ_F_SCHED_USE_HW_TAG
while keeping hctx->sched_tags == NULL and hence trigger a NULL pointer
dereference?

Bart.
Jens Axboe April 28, 2017, 6:22 p.m. UTC | #2
On 04/28/2017 09:15 AM, Ming Lei wrote:  
> +/*
> + * If this queue has enough hardware tags and doesn't share tags with
> + * other queues, just use hw tag directly for scheduling.
> + */
> +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> +{
> +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> +		return false;
> +
> +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> +		return false;

I think we should leave a bigger gap. Ideally, for scheduling, we should
have a hw queue depth that's around half of what the scheduler has to
work with. That will always leave us something to schedule with, if the
hw can't deplete the whole pool.
Bart Van Assche April 28, 2017, 8:11 p.m. UTC | #3
On Fri, 2017-04-28 at 12:22 -0600, Jens Axboe wrote:
> On 04/28/2017 09:15 AM, Ming Lei wrote:  
> > +/*
> > + * If this queue has enough hardware tags and doesn't share tags with
> > + * other queues, just use hw tag directly for scheduling.
> > + */
> > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > +{
> > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > +		return false;
> > +
> > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > +		return false;
> 
> I think we should leave a bigger gap. Ideally, for scheduling, we should
> have a hw queue depth that's around half of what the scheduler has to
> work with. That will always leave us something to schedule with, if the
> hw can't deplete the whole pool.

Hello Jens,

The scsi-mq core allocates exactly the same number of tags per hardware
queue as the SCSI queue depth. Requiring that there is a gap would cause
BLK_MQ_F_SCHED_USE_HW_TAG not to be enabled for any scsi-mq LLD. I'm not
sure that changing the tag allocation strategy in scsi-mq would be the best
solution. How about changing blk_mq_sched_may_use_hw_tag() into something
like the below to guarantee that the scheduler has sufficient tags available?

static bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
{
        return blk_mq_get_queue_depth(q) >= max(q->nr_requests, 16);
}

Thanks,

Bart.
Ming Lei April 29, 2017, 10:35 a.m. UTC | #4
On Fri, Apr 28, 2017 at 06:09:40PM +0000, Bart Van Assche wrote:
> On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > +{
> > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > +		return false;
> > +
> > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > +		return false;
> > +
> > +	return true;
> > +}
> 
> The only user of shared tag sets I know of is scsi-mq. I think it's really
> unfortunate that this patch systematically disables BLK_MQ_F_SCHED_USE_HW_TAG
> for scsi-mq.

In previous patch, I actually allow driver to pass this flag, but this
feature is dropped in this post, just for making it simple & clean.
If you think we need it for shared tag set, I can add it in v1.

For shared tag sets, I suggest to not enable it at default, because
scheduler is per request queue now, and generaly more requests available,
better it performs.  When tags are shared among several request
queues, one of them may use tags up for its own scheduling, then
starve others. But it should be possible and not difficult to allocate
requests fairly for scheduling in this case if we switch to per-hctx
scheduling.

> 
> >  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> >  {
> >  	struct blk_mq_tag_set *set = q->tag_set;
> > @@ -2681,9 +2694,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> >  			break;
> >  	}
> >  
> > -	if (!ret)
> > +	if (!ret) {
> >  		q->nr_requests = nr;
> >  
> > +		blk_mq_update_sched_flag(q);
> > +	}
> > +
> >  	blk_mq_unfreeze_queue(q);
> >  	blk_mq_start_stopped_hw_queues(q, true);
> 
> If a queue is created with a low value of nr_requests that will cause
> blk_mq_sched_alloc_tags() to skip allocation of .sched_tags. If nr_requests
> is increased, can that cause this function to clear BLK_MQ_F_SCHED_USE_HW_TAG
> while keeping hctx->sched_tags == NULL and hence trigger a NULL pointer
> dereference?

Good catch, will fix it in V1.

Given both scheduler switch and updating requests number are not
frequent actions, we can allocate/deallocate .sched_tags just
in need.


Thanks,
Ming
Ming Lei April 29, 2017, 10:59 a.m. UTC | #5
On Fri, Apr 28, 2017 at 12:22:45PM -0600, Jens Axboe wrote:
> On 04/28/2017 09:15 AM, Ming Lei wrote:  
> > +/*
> > + * If this queue has enough hardware tags and doesn't share tags with
> > + * other queues, just use hw tag directly for scheduling.
> > + */
> > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > +{
> > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > +		return false;
> > +
> > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > +		return false;
> 
> I think we should leave a bigger gap. Ideally, for scheduling, we should
> have a hw queue depth that's around half of what the scheduler has to
> work with. That will always leave us something to schedule with, if the
> hw can't deplete the whole pool.

When .sched_tags and .tags are different, it makes sense to make
nr_requests to be two times of queue_depth.

When we switch to schedule with hw tags directly, the euquation
can't be true at all. The simple policy in this patch can't be worsen
than standalone .sched_tags because lifetime of one sched tag is
actually same with request(from its allocation<io submission> to
freeing<io completion>). When we have bigger queue depth, even we can
schedule more.


Thanks,
Ming
Bart Van Assche May 1, 2017, 3:06 p.m. UTC | #6
On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> On Fri, Apr 28, 2017 at 06:09:40PM +0000, Bart Van Assche wrote:
> > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > +{
> > > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > +		return false;
> > > +
> > > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > 
> > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > unfortunate that this patch systematically disables BLK_MQ_F_SCHED_USE_HW_TAG
> > for scsi-mq.
> 
> In previous patch, I actually allow driver to pass this flag, but this
> feature is dropped in this post, just for making it simple & clean.
> If you think we need it for shared tag set, I can add it in v1.
> 
> For shared tag sets, I suggest to not enable it at default, because
> scheduler is per request queue now, and generaly more requests available,
> better it performs.  When tags are shared among several request
> queues, one of them may use tags up for its own scheduling, then
> starve others. But it should be possible and not difficult to allocate
> requests fairly for scheduling in this case if we switch to per-hctx
> scheduling.

Hello Ming,

Have you noticed that there is already a mechanism in the block layer to
avoid starvation if a tag set is shared? The hctx_may_queue() function
guarantees that each user that shares a tag set gets at least some tags.
The .active_queues counter keeps track of the number of hardware queues
that share a tag set.

Bart.
Omar Sandoval May 2, 2017, 3:49 a.m. UTC | #7
On Mon, May 01, 2017 at 03:06:16PM +0000, Bart Van Assche wrote:
> On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> > On Fri, Apr 28, 2017 at 06:09:40PM +0000, Bart Van Assche wrote:
> > > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > > +{
> > > > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > > +		return false;
> > > > +
> > > > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > 
> > > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > > unfortunate that this patch systematically disables BLK_MQ_F_SCHED_USE_HW_TAG
> > > for scsi-mq.
> > 
> > In previous patch, I actually allow driver to pass this flag, but this
> > feature is dropped in this post, just for making it simple & clean.
> > If you think we need it for shared tag set, I can add it in v1.
> > 
> > For shared tag sets, I suggest to not enable it at default, because
> > scheduler is per request queue now, and generaly more requests available,
> > better it performs.  When tags are shared among several request
> > queues, one of them may use tags up for its own scheduling, then
> > starve others. But it should be possible and not difficult to allocate
> > requests fairly for scheduling in this case if we switch to per-hctx
> > scheduling.
> 
> Hello Ming,
> 
> Have you noticed that there is already a mechanism in the block layer to
> avoid starvation if a tag set is shared? The hctx_may_queue() function
> guarantees that each user that shares a tag set gets at least some tags.
> The .active_queues counter keeps track of the number of hardware queues
> that share a tag set.
> 
> Bart.

The scheduler tags are there to abstract away the hardware, and
USE_HW_TAG should just be an optimization for when that abstraction is a
noop. That's not the case when there are shared tags, and I doubt that
the overhead of the scheduler tags is significant for scsi-mq. Let's
stick with the behavior Ming had here.
Ming Lei May 2, 2017, 8:46 a.m. UTC | #8
On Mon, May 01, 2017 at 03:06:16PM +0000, Bart Van Assche wrote:
> On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> > On Fri, Apr 28, 2017 at 06:09:40PM +0000, Bart Van Assche wrote:
> > > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > > +{
> > > > +	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > > +		return false;
> > > > +
> > > > +	if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > 
> > > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > > unfortunate that this patch systematically disables BLK_MQ_F_SCHED_USE_HW_TAG
> > > for scsi-mq.
> > 
> > In previous patch, I actually allow driver to pass this flag, but this
> > feature is dropped in this post, just for making it simple & clean.
> > If you think we need it for shared tag set, I can add it in v1.
> > 
> > For shared tag sets, I suggest to not enable it at default, because
> > scheduler is per request queue now, and generaly more requests available,
> > better it performs.  When tags are shared among several request
> > queues, one of them may use tags up for its own scheduling, then
> > starve others. But it should be possible and not difficult to allocate
> > requests fairly for scheduling in this case if we switch to per-hctx
> > scheduling.
> 
> Hello Ming,
> 
> Have you noticed that there is already a mechanism in the block layer to
> avoid starvation if a tag set is shared? The hctx_may_queue() function
> guarantees that each user that shares a tag set gets at least some tags.
> The .active_queues counter keeps track of the number of hardware queues
> that share a tag set.

OK, from hctx_may_queue(), each hw queue can be allocated at most
tags of (queue depth / nr_hw_queues), and we can allow to use
hw tag for scheduler too if the following condition is ture for
shared tags:

	q->nr_requests <= (blk_mq_get_queue_depth(q) / nr_hw_queues)

It should often be true for some scsi devices(such as virtio-scsi)
in some configurations.


thanks,
Ming
Omar Sandoval May 3, 2017, 4:29 p.m. UTC | #9
On Fri, Apr 28, 2017 at 11:15:38PM +0800, Ming Lei wrote:
> When tag space of one device is big enough, we use hw tag
> directly for I/O scheduling.
> 
> Now the decision is made if hw queue depth is not less than
> q->nr_requests and the tag set isn't shared.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c |  8 ++++++++
>  block/blk-mq-sched.h | 15 +++++++++++++++
>  block/blk-mq.c       | 18 +++++++++++++++++-
>  3 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 45a675f07b8b..4681e27c127e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -507,6 +507,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>  	struct elevator_queue *eq;
>  	unsigned int i;
>  	int ret;
> +	bool auto_hw_tag;
>  
>  	if (!e) {
>  		q->elevator = NULL;
> @@ -519,7 +520,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>  	 */
>  	q->nr_requests = 2 * BLKDEV_MAX_RQ;
>  
> +	auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
> +
>  	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (auto_hw_tag)
> +			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
> +		else
> +			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
> +
>  		ret = blk_mq_sched_alloc_tags(q, hctx, i);
>  		if (ret)
>  			goto err;

I think you should also clear the BLK_MQ_F_SCHED_USE_HW_TAG flag in
blk_mq_exit_sched()?
Ming Lei May 3, 2017, 4:55 p.m. UTC | #10
On Wed, May 03, 2017 at 09:29:36AM -0700, Omar Sandoval wrote:
> On Fri, Apr 28, 2017 at 11:15:38PM +0800, Ming Lei wrote:
> > When tag space of one device is big enough, we use hw tag
> > directly for I/O scheduling.
> > 
> > Now the decision is made if hw queue depth is not less than
> > q->nr_requests and the tag set isn't shared.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-sched.c |  8 ++++++++
> >  block/blk-mq-sched.h | 15 +++++++++++++++
> >  block/blk-mq.c       | 18 +++++++++++++++++-
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 45a675f07b8b..4681e27c127e 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -507,6 +507,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> >  	struct elevator_queue *eq;
> >  	unsigned int i;
> >  	int ret;
> > +	bool auto_hw_tag;
> >  
> >  	if (!e) {
> >  		q->elevator = NULL;
> > @@ -519,7 +520,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> >  	 */
> >  	q->nr_requests = 2 * BLKDEV_MAX_RQ;
> >  
> > +	auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
> > +
> >  	queue_for_each_hw_ctx(q, hctx, i) {
> > +		if (auto_hw_tag)
> > +			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
> > +		else
> > +			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
> > +
> >  		ret = blk_mq_sched_alloc_tags(q, hctx, i);
> >  		if (ret)
> >  			goto err;
> 
> I think you should also clear the BLK_MQ_F_SCHED_USE_HW_TAG flag in
> blk_mq_exit_sched()?

Looks not necessary since the flag is always evaluated in
blk_mq_init_sched().

Thanks,
Ming
Omar Sandoval May 3, 2017, 5 p.m. UTC | #11
On Thu, May 04, 2017 at 12:55:30AM +0800, Ming Lei wrote:
> On Wed, May 03, 2017 at 09:29:36AM -0700, Omar Sandoval wrote:
> > On Fri, Apr 28, 2017 at 11:15:38PM +0800, Ming Lei wrote:
> > > When tag space of one device is big enough, we use hw tag
> > > directly for I/O scheduling.
> > > 
> > > Now the decision is made if hw queue depth is not less than
> > > q->nr_requests and the tag set isn't shared.
> > > 
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-mq-sched.c |  8 ++++++++
> > >  block/blk-mq-sched.h | 15 +++++++++++++++
> > >  block/blk-mq.c       | 18 +++++++++++++++++-
> > >  3 files changed, 40 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 45a675f07b8b..4681e27c127e 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -507,6 +507,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> > >  	struct elevator_queue *eq;
> > >  	unsigned int i;
> > >  	int ret;
> > > +	bool auto_hw_tag;
> > >  
> > >  	if (!e) {
> > >  		q->elevator = NULL;
> > > @@ -519,7 +520,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> > >  	 */
> > >  	q->nr_requests = 2 * BLKDEV_MAX_RQ;
> > >  
> > > +	auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
> > > +
> > >  	queue_for_each_hw_ctx(q, hctx, i) {
> > > +		if (auto_hw_tag)
> > > +			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
> > > +		else
> > > +			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
> > > +
> > >  		ret = blk_mq_sched_alloc_tags(q, hctx, i);
> > >  		if (ret)
> > >  			goto err;
> > 
> > I think you should also clear the BLK_MQ_F_SCHED_USE_HW_TAG flag in
> > blk_mq_exit_sched()?
> 
> Looks not necessary since the flag is always evaluated in
> blk_mq_init_sched().

What if we're setting the scheduler to "none"? Then blk_mq_init_sched()
will go in here:

if (!e) {
	q->elevator = NULL;
	return 0;
}
Ming Lei May 3, 2017, 5:33 p.m. UTC | #12
On Wed, May 03, 2017 at 10:00:04AM -0700, Omar Sandoval wrote:
> On Thu, May 04, 2017 at 12:55:30AM +0800, Ming Lei wrote:
> > On Wed, May 03, 2017 at 09:29:36AM -0700, Omar Sandoval wrote:
> > > On Fri, Apr 28, 2017 at 11:15:38PM +0800, Ming Lei wrote:
> > > > When tag space of one device is big enough, we use hw tag
> > > > directly for I/O scheduling.
> > > > 
> > > > Now the decision is made if hw queue depth is not less than
> > > > q->nr_requests and the tag set isn't shared.
> > > > 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  block/blk-mq-sched.c |  8 ++++++++
> > > >  block/blk-mq-sched.h | 15 +++++++++++++++
> > > >  block/blk-mq.c       | 18 +++++++++++++++++-
> > > >  3 files changed, 40 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index 45a675f07b8b..4681e27c127e 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -507,6 +507,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> > > >  	struct elevator_queue *eq;
> > > >  	unsigned int i;
> > > >  	int ret;
> > > > +	bool auto_hw_tag;
> > > >  
> > > >  	if (!e) {
> > > >  		q->elevator = NULL;
> > > > @@ -519,7 +520,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> > > >  	 */
> > > >  	q->nr_requests = 2 * BLKDEV_MAX_RQ;
> > > >  
> > > > +	auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
> > > > +
> > > >  	queue_for_each_hw_ctx(q, hctx, i) {
> > > > +		if (auto_hw_tag)
> > > > +			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
> > > > +		else
> > > > +			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
> > > > +
> > > >  		ret = blk_mq_sched_alloc_tags(q, hctx, i);
> > > >  		if (ret)
> > > >  			goto err;
> > > 
> > > I think you should also clear the BLK_MQ_F_SCHED_USE_HW_TAG flag in
> > > blk_mq_exit_sched()?
> > 
> > Looks not necessary since the flag is always evaluated in
> > blk_mq_init_sched().
> 
> What if we're setting the scheduler to "none"? Then blk_mq_init_sched()
> will go in here:
> 
> if (!e) {
> 	q->elevator = NULL;
> 	return 0;
> }

Good catch, will clear the flag in blk_mq_exit_sched() in V2.

Thanks,
Ming
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 45a675f07b8b..4681e27c127e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -507,6 +507,7 @@  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
+	bool auto_hw_tag;
 
 	if (!e) {
 		q->elevator = NULL;
@@ -519,7 +520,14 @@  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
+	auto_hw_tag = blk_mq_sched_may_use_hw_tag(q);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
+		if (auto_hw_tag)
+			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
+		else
+			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
+
 		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
 			goto err;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index edafb5383b7b..22a19c118044 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -129,4 +129,19 @@  static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+/*
+ * If this queue has enough hardware tags and doesn't share tags with
+ * other queues, just use hw tag directly for scheduling.
+ */
+static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
+{
+	if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
+		return false;
+
+	if (blk_mq_get_queue_depth(q) < q->nr_requests)
+		return false;
+
+	return true;
+}
+
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 04761fb76ab4..b0bd1fb4b0f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2649,6 +2649,19 @@  int blk_mq_get_queue_depth(struct request_queue *q)
 	return tags->bitmap_tags.sb.depth;
 }
 
+static void blk_mq_update_sched_flag(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	int i;
+
+	if (!blk_mq_sched_may_use_hw_tag(q))
+		queue_for_each_hw_ctx(q, hctx, i)
+			hctx->flags &= ~BLK_MQ_F_SCHED_USE_HW_TAG;
+	else
+		queue_for_each_hw_ctx(q, hctx, i)
+			hctx->flags |= BLK_MQ_F_SCHED_USE_HW_TAG;
+}
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -2681,9 +2694,12 @@  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			break;
 	}
 
-	if (!ret)
+	if (!ret) {
 		q->nr_requests = nr;
 
+		blk_mq_update_sched_flag(q);
+	}
+
 	blk_mq_unfreeze_queue(q);
 	blk_mq_start_stopped_hw_queues(q, true);