diff mbox

[1/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG

Message ID 20170428151539.25514-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei April 28, 2017, 3:15 p.m. UTC
When blk-mq I/O scheduler is used, we need two tags for
submitting one request. One is called scheduler tag for
allocating request and scheduling I/O, another one is called
driver tag, which is used for dispatching IO to hardware/driver.
This way introduces one extra per-queue allocation for both tags
and request pool, and may not be as efficient as case of none
scheduler.

Also currently we put a default per-hctx limit on schedulable
requests, and this limit may be a bottleneck for some devices,
especialy when these devices have a quite big tag space.

This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
allow to use hardware/driver tags directly for IO scheduling if
devices's hardware tag space is big enough. Then we can avoid
the extra resource allocation and make IO submission more
efficient.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 10 +++++++++-
 block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  1 +
 3 files changed, 39 insertions(+), 7 deletions(-)

Comments

Omar Sandoval May 3, 2017, 4:21 p.m. UTC | #1
On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> When blk-mq I/O scheduler is used, we need two tags for
> submitting one request. One is called scheduler tag for
> allocating request and scheduling I/O, another one is called
> driver tag, which is used for dispatching IO to hardware/driver.
> This way introduces one extra per-queue allocation for both tags
> and request pool, and may not be as efficient as case of none
> scheduler.
> 
> Also currently we put a default per-hctx limit on schedulable
> requests, and this limit may be a bottleneck for some devices,
> especialy when these devices have a quite big tag space.
> 
> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> allow to use hardware/driver tags directly for IO scheduling if
> devices's hardware tag space is big enough. Then we can avoid
> the extra resource allocation and make IO submission more
> efficient.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c   | 10 +++++++++-
>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 39 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 27c67465f856..45a675f07b8b 100644
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0168b27469cb..e530bc54f0d9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -263,9 +263,19 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
>  				rq->rq_flags = RQF_MQ_INFLIGHT;
>  				atomic_inc(&data->hctx->nr_active);
>  			}
> -			rq->tag = tag;
> -			rq->internal_tag = -1;
> -			data->hctx->tags->rqs[rq->tag] = rq;
> +			data->hctx->tags->rqs[tag] = rq;
> +
> +			/*
> +			 * If we use hw tag for scheduling, postpone setting
> +			 * rq->tag in blk_mq_get_driver_tag().
> +			 */
> +			if (data->hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
> +				rq->tag = -1;
> +				rq->internal_tag = tag;
> +			} else {
> +				rq->tag = tag;
> +				rq->internal_tag = -1;
> +			}

I'm guessing you did it this way because we currently check rq->tag to
decided whether this is a flush that needs to be bypassed? Makes sense,
but I'm adding it to my list of reasons why the flush stuff sucks.

> @@ -893,9 +909,15 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>  				    struct request *rq)
>  {
> -	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> +	unsigned tag = rq->tag;

The pickiest of all nits, but we mostly spell out `unsigned int` in this
file, it'd be nice to stay consistent.
Omar Sandoval May 3, 2017, 4:46 p.m. UTC | #2
On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> When blk-mq I/O scheduler is used, we need two tags for
> submitting one request. One is called scheduler tag for
> allocating request and scheduling I/O, another one is called
> driver tag, which is used for dispatching IO to hardware/driver.
> This way introduces one extra per-queue allocation for both tags
> and request pool, and may not be as efficient as case of none
> scheduler.
> 
> Also currently we put a default per-hctx limit on schedulable
> requests, and this limit may be a bottleneck for some devices,
> especialy when these devices have a quite big tag space.
> 
> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> allow to use hardware/driver tags directly for IO scheduling if
> devices's hardware tag space is big enough. Then we can avoid
> the extra resource allocation and make IO submission more
> efficient.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c   | 10 +++++++++-
>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 39 insertions(+), 7 deletions(-)

One more note on this: if we're using the hardware tags directly, then
we are no longer limited to q->nr_requests requests in-flight. Instead,
we're limited to the hw queue depth. We probably want to maintain the
original behavior, so I think we need to resize the hw tags in
blk_mq_init_sched() if we're using hardware tags.
Ming Lei May 3, 2017, 8:13 p.m. UTC | #3
On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
>> When blk-mq I/O scheduler is used, we need two tags for
>> submitting one request. One is called scheduler tag for
>> allocating request and scheduling I/O, another one is called
>> driver tag, which is used for dispatching IO to hardware/driver.
>> This way introduces one extra per-queue allocation for both tags
>> and request pool, and may not be as efficient as case of none
>> scheduler.
>>
>> Also currently we put a default per-hctx limit on schedulable
>> requests, and this limit may be a bottleneck for some devices,
>> especialy when these devices have a quite big tag space.
>>
>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
>> allow to use hardware/driver tags directly for IO scheduling if
>> devices's hardware tag space is big enough. Then we can avoid
>> the extra resource allocation and make IO submission more
>> efficient.
>>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> ---
>>  block/blk-mq-sched.c   | 10 +++++++++-
>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>>  include/linux/blk-mq.h |  1 +
>>  3 files changed, 39 insertions(+), 7 deletions(-)
>
> One more note on this: if we're using the hardware tags directly, then
> we are no longer limited to q->nr_requests requests in-flight. Instead,
> we're limited to the hw queue depth. We probably want to maintain the
> original behavior,

That need further investigation, and generally scheduler should be happy with
more requests which can be scheduled.

We can make it as one follow-up.

> so I think we need to resize the hw tags in blk_mq_init_sched() if we're using hardware tags.

That might not be good since hw tags are used by both scheduler and dispatching.


Thanks,
Ming Lei
Omar Sandoval May 3, 2017, 9:40 p.m. UTC | #4
On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
> > On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> >> When blk-mq I/O scheduler is used, we need two tags for
> >> submitting one request. One is called scheduler tag for
> >> allocating request and scheduling I/O, another one is called
> >> driver tag, which is used for dispatching IO to hardware/driver.
> >> This way introduces one extra per-queue allocation for both tags
> >> and request pool, and may not be as efficient as case of none
> >> scheduler.
> >>
> >> Also currently we put a default per-hctx limit on schedulable
> >> requests, and this limit may be a bottleneck for some devices,
> >> especialy when these devices have a quite big tag space.
> >>
> >> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> >> allow to use hardware/driver tags directly for IO scheduling if
> >> devices's hardware tag space is big enough. Then we can avoid
> >> the extra resource allocation and make IO submission more
> >> efficient.
> >>
> >> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >> ---
> >>  block/blk-mq-sched.c   | 10 +++++++++-
> >>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
> >>  include/linux/blk-mq.h |  1 +
> >>  3 files changed, 39 insertions(+), 7 deletions(-)
> >
> > One more note on this: if we're using the hardware tags directly, then
> > we are no longer limited to q->nr_requests requests in-flight. Instead,
> > we're limited to the hw queue depth. We probably want to maintain the
> > original behavior,
> 
> That need further investigation, and generally scheduler should be happy with
> more requests which can be scheduled.
> 
> We can make it as one follow-up.

If we say nr_requests is 256, then we should honor that. So either
update nr_requests to reflect the actual depth we're using or resize the
hardware tags.

> > so I think we need to resize the hw tags in blk_mq_init_sched() if we're using hardware tags.
> 
> That might not be good since hw tags are used by both scheduler and dispatching.

What do you mean? If we have BLK_MQ_F_SCHED_USE_HW_TAG set, then they
are not used for dispatching, and of course we shouldn't resize the
hardware tags if we are using scheduler tags.
Ming Lei May 4, 2017, 2:01 a.m. UTC | #5
On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
>> > On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
>> >> When blk-mq I/O scheduler is used, we need two tags for
>> >> submitting one request. One is called scheduler tag for
>> >> allocating request and scheduling I/O, another one is called
>> >> driver tag, which is used for dispatching IO to hardware/driver.
>> >> This way introduces one extra per-queue allocation for both tags
>> >> and request pool, and may not be as efficient as case of none
>> >> scheduler.
>> >>
>> >> Also currently we put a default per-hctx limit on schedulable
>> >> requests, and this limit may be a bottleneck for some devices,
>> >> especialy when these devices have a quite big tag space.
>> >>
>> >> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
>> >> allow to use hardware/driver tags directly for IO scheduling if
>> >> devices's hardware tag space is big enough. Then we can avoid
>> >> the extra resource allocation and make IO submission more
>> >> efficient.
>> >>
>> >> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> >> ---
>> >>  block/blk-mq-sched.c   | 10 +++++++++-
>> >>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>> >>  include/linux/blk-mq.h |  1 +
>> >>  3 files changed, 39 insertions(+), 7 deletions(-)
>> >
>> > One more note on this: if we're using the hardware tags directly, then
>> > we are no longer limited to q->nr_requests requests in-flight. Instead,
>> > we're limited to the hw queue depth. We probably want to maintain the
>> > original behavior,
>>
>> That need further investigation, and generally scheduler should be happy with
>> more requests which can be scheduled.
>>
>> We can make it as one follow-up.
>
> If we say nr_requests is 256, then we should honor that. So either
> update nr_requests to reflect the actual depth we're using or resize the
> hardware tags.

Firstly nr_requests is set as 256 from blk-mq inside instead of user
space, it won't be
a big deal to violate that.

Secondly, when there is enough tags available, it might hurt
performance if we don't
use them all.

>
>> > so I think we need to resize the hw tags in blk_mq_init_sched() if we're using hardware tags.
>>
>> That might not be good since hw tags are used by both scheduler and dispatching.
>
> What do you mean? If we have BLK_MQ_F_SCHED_USE_HW_TAG set, then they
> are not used for dispatching, and of course we shouldn't resize the
> hardware tags if we are using scheduler tags.

The tag is used by both sched and dispatching actually, and if you
resize the hw tags,
the tag space for dispatching is resized(decreased) too.


Thanks,
Ming Lei
Jens Axboe May 4, 2017, 2:13 a.m. UTC | #6
On 05/03/2017 08:01 PM, Ming Lei wrote:
> On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
>> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
>>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
>>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
>>>>> When blk-mq I/O scheduler is used, we need two tags for
>>>>> submitting one request. One is called scheduler tag for
>>>>> allocating request and scheduling I/O, another one is called
>>>>> driver tag, which is used for dispatching IO to hardware/driver.
>>>>> This way introduces one extra per-queue allocation for both tags
>>>>> and request pool, and may not be as efficient as case of none
>>>>> scheduler.
>>>>>
>>>>> Also currently we put a default per-hctx limit on schedulable
>>>>> requests, and this limit may be a bottleneck for some devices,
>>>>> especialy when these devices have a quite big tag space.
>>>>>
>>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
>>>>> allow to use hardware/driver tags directly for IO scheduling if
>>>>> devices's hardware tag space is big enough. Then we can avoid
>>>>> the extra resource allocation and make IO submission more
>>>>> efficient.
>>>>>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>> ---
>>>>>  block/blk-mq-sched.c   | 10 +++++++++-
>>>>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>>>>>  include/linux/blk-mq.h |  1 +
>>>>>  3 files changed, 39 insertions(+), 7 deletions(-)
>>>>
>>>> One more note on this: if we're using the hardware tags directly, then
>>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
>>>> we're limited to the hw queue depth. We probably want to maintain the
>>>> original behavior,
>>>
>>> That need further investigation, and generally scheduler should be happy with
>>> more requests which can be scheduled.
>>>
>>> We can make it as one follow-up.
>>
>> If we say nr_requests is 256, then we should honor that. So either
>> update nr_requests to reflect the actual depth we're using or resize the
>> hardware tags.
> 
> Firstly nr_requests is set as 256 from blk-mq inside instead of user
> space, it won't be a big deal to violate that.

The legacy scheduling layer used 2*128 by default, that's why I used the
"magic" 256 internally. FWIW, I agree with Omar here. If it's set to
256, we must honor that. Users will tweak this value down to trade peak
performance for latency, it's important that it does what it advertises.

> Secondly, when there is enough tags available, it might hurt
> performance if we don't use them all.

That's mostly bogus. Crazy large tag depths have only one use case -
synthetic peak performance benchmarks from manufacturers. We don't want
to allow really deep queues. Nothing good comes from that, just a lot of
pain and latency issues.

The most important part is actually that the scheduler has a higher
depth than the device, as mentioned in an email from a few days ago. We
need to be able to actually schedule IO to the device, we can't do that
if we always deplete the scheduler queue by letting the device drain it.
Ming Lei May 4, 2017, 2:51 a.m. UTC | #7
On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote:
> On 05/03/2017 08:01 PM, Ming Lei wrote:
> > On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
> >>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> >>>>> When blk-mq I/O scheduler is used, we need two tags for
> >>>>> submitting one request. One is called scheduler tag for
> >>>>> allocating request and scheduling I/O, another one is called
> >>>>> driver tag, which is used for dispatching IO to hardware/driver.
> >>>>> This way introduces one extra per-queue allocation for both tags
> >>>>> and request pool, and may not be as efficient as case of none
> >>>>> scheduler.
> >>>>>
> >>>>> Also currently we put a default per-hctx limit on schedulable
> >>>>> requests, and this limit may be a bottleneck for some devices,
> >>>>> especialy when these devices have a quite big tag space.
> >>>>>
> >>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> >>>>> allow to use hardware/driver tags directly for IO scheduling if
> >>>>> devices's hardware tag space is big enough. Then we can avoid
> >>>>> the extra resource allocation and make IO submission more
> >>>>> efficient.
> >>>>>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>> ---
> >>>>>  block/blk-mq-sched.c   | 10 +++++++++-
> >>>>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
> >>>>>  include/linux/blk-mq.h |  1 +
> >>>>>  3 files changed, 39 insertions(+), 7 deletions(-)
> >>>>
> >>>> One more note on this: if we're using the hardware tags directly, then
> >>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
> >>>> we're limited to the hw queue depth. We probably want to maintain the
> >>>> original behavior,
> >>>
> >>> That need further investigation, and generally scheduler should be happy with
> >>> more requests which can be scheduled.
> >>>
> >>> We can make it as one follow-up.
> >>
> >> If we say nr_requests is 256, then we should honor that. So either
> >> update nr_requests to reflect the actual depth we're using or resize the
> >> hardware tags.
> > 
> > Firstly nr_requests is set as 256 from blk-mq inside instead of user
> > space, it won't be a big deal to violate that.
> 
> The legacy scheduling layer used 2*128 by default, that's why I used the
> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to
> 256, we must honor that. Users will tweak this value down to trade peak
> performance for latency, it's important that it does what it advertises.

In case of scheduling with hw tags, we share tags between scheduler and
dispatching, if we resize(only decrease actually) the tags, dispatching
space(hw tags) is decreased too. That means the actual usable device tag
space need to be decreased much.

> 
> > Secondly, when there is enough tags available, it might hurt
> > performance if we don't use them all.
> 
> That's mostly bogus. Crazy large tag depths have only one use case -
> synthetic peak performance benchmarks from manufacturers. We don't want
> to allow really deep queues. Nothing good comes from that, just a lot of
> pain and latency issues.

Given device provides so high queue depth, it might be reasonable to just
allow to use them up. For example of NVMe, once mq scheduler is enabled,
the actual size of device tag space is just 256 at default, even though
the hardware provides very big tag space(>= 10K).

The problem is that lifetime of sched tag is same with request's
lifetime(from submission to completion), and it covers lifetime of
device tag.  In theory sched tag should have been freed just after
the rq is dispatched to driver. Unfortunately we can't do that because
request is allocated from sched tag set.

> 
> The most important part is actually that the scheduler has a higher
> depth than the device, as mentioned in an email from a few days ago. We

I agree this point, but:

Unfortunately in case of NVMe or other high depth devices, the default
scheduler queue depth(256) is much less than device depth, do we need to
adjust the default value for this devices? In theory, the default 256
scheduler depth may hurt performance on this devices since the device
tag space is much under-utilized.


Thanks,
Ming
Jens Axboe May 4, 2017, 2:06 p.m. UTC | #8
On 05/03/2017 08:51 PM, Ming Lei wrote:
> On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote:
>> On 05/03/2017 08:01 PM, Ming Lei wrote:
>>> On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
>>>> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
>>>>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
>>>>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
>>>>>>> When blk-mq I/O scheduler is used, we need two tags for
>>>>>>> submitting one request. One is called scheduler tag for
>>>>>>> allocating request and scheduling I/O, another one is called
>>>>>>> driver tag, which is used for dispatching IO to hardware/driver.
>>>>>>> This way introduces one extra per-queue allocation for both tags
>>>>>>> and request pool, and may not be as efficient as case of none
>>>>>>> scheduler.
>>>>>>>
>>>>>>> Also currently we put a default per-hctx limit on schedulable
>>>>>>> requests, and this limit may be a bottleneck for some devices,
>>>>>>> especialy when these devices have a quite big tag space.
>>>>>>>
>>>>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
>>>>>>> allow to use hardware/driver tags directly for IO scheduling if
>>>>>>> devices's hardware tag space is big enough. Then we can avoid
>>>>>>> the extra resource allocation and make IO submission more
>>>>>>> efficient.
>>>>>>>
>>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>>> ---
>>>>>>>  block/blk-mq-sched.c   | 10 +++++++++-
>>>>>>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
>>>>>>>  include/linux/blk-mq.h |  1 +
>>>>>>>  3 files changed, 39 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> One more note on this: if we're using the hardware tags directly, then
>>>>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
>>>>>> we're limited to the hw queue depth. We probably want to maintain the
>>>>>> original behavior,
>>>>>
>>>>> That need further investigation, and generally scheduler should be happy with
>>>>> more requests which can be scheduled.
>>>>>
>>>>> We can make it as one follow-up.
>>>>
>>>> If we say nr_requests is 256, then we should honor that. So either
>>>> update nr_requests to reflect the actual depth we're using or resize the
>>>> hardware tags.
>>>
>>> Firstly nr_requests is set as 256 from blk-mq inside instead of user
>>> space, it won't be a big deal to violate that.
>>
>> The legacy scheduling layer used 2*128 by default, that's why I used the
>> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to
>> 256, we must honor that. Users will tweak this value down to trade peak
>> performance for latency, it's important that it does what it advertises.
> 
> In case of scheduling with hw tags, we share tags between scheduler and
> dispatching, if we resize(only decrease actually) the tags, dispatching
> space(hw tags) is decreased too. That means the actual usable device tag
> space need to be decreased much.

I think the solution here is to handle it differently. Previous, we had
requests and tags independent. That meant that we could have an
independent set of requests for scheduling, then assign tags as we need
to dispatch them to hardware. This is how the old schedulers worked, and
with the scheduler tags, this is how the new blk-mq scheduling works as
well.

Once you start treating them as one space again, we run into this issue.
I can think of two solutions:

1) Keep our current split, so we can schedule independently of hardware
   tags.

2) Throttle the queue depth independently. If the user asks for a depth
   of, eg, 32, retain a larger set of requests but limit the queue depth
   on the device side fo 32.

This is much easier to support with split hardware and scheduler tags...

>>> Secondly, when there is enough tags available, it might hurt
>>> performance if we don't use them all.
>>
>> That's mostly bogus. Crazy large tag depths have only one use case -
>> synthetic peak performance benchmarks from manufacturers. We don't want
>> to allow really deep queues. Nothing good comes from that, just a lot of
>> pain and latency issues.
> 
> Given device provides so high queue depth, it might be reasonable to just
> allow to use them up. For example of NVMe, once mq scheduler is enabled,
> the actual size of device tag space is just 256 at default, even though
> the hardware provides very big tag space(>= 10K).

Correct.

> The problem is that lifetime of sched tag is same with request's
> lifetime(from submission to completion), and it covers lifetime of
> device tag.  In theory sched tag should have been freed just after
> the rq is dispatched to driver. Unfortunately we can't do that because
> request is allocated from sched tag set.

Yep

>> The most important part is actually that the scheduler has a higher
>> depth than the device, as mentioned in an email from a few days ago. We
> 
> I agree this point, but:
> 
> Unfortunately in case of NVMe or other high depth devices, the default
> scheduler queue depth(256) is much less than device depth, do we need to
> adjust the default value for this devices? In theory, the default 256
> scheduler depth may hurt performance on this devices since the device
> tag space is much under-utilized.

No we do not. 256 is a LOT. I realize most of the devices expose 64K *
num_hw_queues of depth. Expecting to utilize all that is insane.
Internally, these devices have nowhere near that amount of parallelism.
Hence we'd go well beyond the latency knee in the curve if we just allow
tons of writeback to queue up, for example. Reaching peak performance on
these devices do not require more than 256 requests, in fact it can be
done much sooner. For a default setting, I'd actually argue that 256 is
too much, and that we should set it lower.
Ming Lei May 5, 2017, 10:54 p.m. UTC | #9
On Thu, May 04, 2017 at 08:06:15AM -0600, Jens Axboe wrote:
> On 05/03/2017 08:51 PM, Ming Lei wrote:
> > On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote:
> >> On 05/03/2017 08:01 PM, Ming Lei wrote:
> >>> On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >>>> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
> >>>>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@osandov.com> wrote:
> >>>>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> >>>>>>> When blk-mq I/O scheduler is used, we need two tags for
> >>>>>>> submitting one request. One is called scheduler tag for
> >>>>>>> allocating request and scheduling I/O, another one is called
> >>>>>>> driver tag, which is used for dispatching IO to hardware/driver.
> >>>>>>> This way introduces one extra per-queue allocation for both tags
> >>>>>>> and request pool, and may not be as efficient as case of none
> >>>>>>> scheduler.
> >>>>>>>
> >>>>>>> Also currently we put a default per-hctx limit on schedulable
> >>>>>>> requests, and this limit may be a bottleneck for some devices,
> >>>>>>> especialy when these devices have a quite big tag space.
> >>>>>>>
> >>>>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> >>>>>>> allow to use hardware/driver tags directly for IO scheduling if
> >>>>>>> devices's hardware tag space is big enough. Then we can avoid
> >>>>>>> the extra resource allocation and make IO submission more
> >>>>>>> efficient.
> >>>>>>>
> >>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>>>> ---
> >>>>>>>  block/blk-mq-sched.c   | 10 +++++++++-
> >>>>>>>  block/blk-mq.c         | 35 +++++++++++++++++++++++++++++------
> >>>>>>>  include/linux/blk-mq.h |  1 +
> >>>>>>>  3 files changed, 39 insertions(+), 7 deletions(-)
> >>>>>>
> >>>>>> One more note on this: if we're using the hardware tags directly, then
> >>>>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
> >>>>>> we're limited to the hw queue depth. We probably want to maintain the
> >>>>>> original behavior,
> >>>>>
> >>>>> That need further investigation, and generally scheduler should be happy with
> >>>>> more requests which can be scheduled.
> >>>>>
> >>>>> We can make it as one follow-up.
> >>>>
> >>>> If we say nr_requests is 256, then we should honor that. So either
> >>>> update nr_requests to reflect the actual depth we're using or resize the
> >>>> hardware tags.
> >>>
> >>> Firstly nr_requests is set as 256 from blk-mq inside instead of user
> >>> space, it won't be a big deal to violate that.
> >>
> >> The legacy scheduling layer used 2*128 by default, that's why I used the
> >> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to
> >> 256, we must honor that. Users will tweak this value down to trade peak
> >> performance for latency, it's important that it does what it advertises.
> > 
> > In case of scheduling with hw tags, we share tags between scheduler and
> > dispatching, if we resize(only decrease actually) the tags, dispatching
> > space(hw tags) is decreased too. That means the actual usable device tag
> > space need to be decreased much.
> 
> I think the solution here is to handle it differently. Previous, we had
> requests and tags independent. That meant that we could have an
> independent set of requests for scheduling, then assign tags as we need
> to dispatch them to hardware. This is how the old schedulers worked, and
> with the scheduler tags, this is how the new blk-mq scheduling works as
> well.
> 
> Once you start treating them as one space again, we run into this issue.
> I can think of two solutions:
> 
> 1) Keep our current split, so we can schedule independently of hardware
>    tags.

Actually hw tag depends on scheduler tag as I explained, so both aren't
independently, and even they looks split.

Also I am not sure how we do that if we need to support scheduling with
hw tag, could you explain it a bit?

> 
> 2) Throttle the queue depth independently. If the user asks for a depth
>    of, eg, 32, retain a larger set of requests but limit the queue depth
>    on the device side fo 32.

If I understand correctly, we can support scheduling with hw tag by this
patchset plus setting q->nr_requests as size of hw tag space. Otherwise
it isn't easy to throttle the queue depth independently because hw tag
actually depends on scheduler tag.

The 3rd one is to follow Omar's suggestion, by resizing the queue depth
as q->nr_requests simply.

> 
> This is much easier to support with split hardware and scheduler tags...
> 
> >>> Secondly, when there is enough tags available, it might hurt
> >>> performance if we don't use them all.
> >>
> >> That's mostly bogus. Crazy large tag depths have only one use case -
> >> synthetic peak performance benchmarks from manufacturers. We don't want
> >> to allow really deep queues. Nothing good comes from that, just a lot of
> >> pain and latency issues.
> > 
> > Given device provides so high queue depth, it might be reasonable to just
> > allow to use them up. For example of NVMe, once mq scheduler is enabled,
> > the actual size of device tag space is just 256 at default, even though
> > the hardware provides very big tag space(>= 10K).
> 
> Correct.
> 
> > The problem is that lifetime of sched tag is same with request's
> > lifetime(from submission to completion), and it covers lifetime of
> > device tag.  In theory sched tag should have been freed just after
> > the rq is dispatched to driver. Unfortunately we can't do that because
> > request is allocated from sched tag set.
> 
> Yep

Request copying like what your first post of mq scheduler patch did
may fix this issue, and in that way we can make both two tag space
independent, but with extra cost. What do you think about this approach?

> 
> >> The most important part is actually that the scheduler has a higher
> >> depth than the device, as mentioned in an email from a few days ago. We
> > 
> > I agree this point, but:
> > 
> > Unfortunately in case of NVMe or other high depth devices, the default
> > scheduler queue depth(256) is much less than device depth, do we need to
> > adjust the default value for this devices? In theory, the default 256
> > scheduler depth may hurt performance on this devices since the device
> > tag space is much under-utilized.
> 
> No we do not. 256 is a LOT. I realize most of the devices expose 64K *
> num_hw_queues of depth. Expecting to utilize all that is insane.
> Internally, these devices have nowhere near that amount of parallelism.
> Hence we'd go well beyond the latency knee in the curve if we just allow
> tons of writeback to queue up, for example. Reaching peak performance on
> these devices do not require more than 256 requests, in fact it can be
> done much sooner. For a default setting, I'd actually argue that 256 is
> too much, and that we should set it lower.

Then the question is that why storage hardware/industry introduce
so big tag space in modern storage hardwares? Definitly tag space isn't
free from view of hardware.

IMO, big tag space may improve latency a bit in heavy I/O load, and my
simple test has showed the improvement.

Is there anyone who knows the design befind big tag space/hw queue
depth? Cc NVMe and SCSI list, since there may have more storage
hw experts.

IMO, if this degisn is just for meeting some very special performance
requirements, block layer should have allowed useres to make full use
of the big tag space.


Thanks,
Ming
Ming Lei May 10, 2017, 7:25 a.m. UTC | #10
Hi Jens,

On Thu, May 04, 2017 at 08:06:15AM -0600, Jens Axboe wrote:
...
> 
> No we do not. 256 is a LOT. I realize most of the devices expose 64K *
> num_hw_queues of depth. Expecting to utilize all that is insane.
> Internally, these devices have nowhere near that amount of parallelism.
> Hence we'd go well beyond the latency knee in the curve if we just allow
> tons of writeback to queue up, for example. Reaching peak performance on
> these devices do not require more than 256 requests, in fact it can be
> done much sooner. For a default setting, I'd actually argue that 256 is
> too much, and that we should set it lower.

After studying SSD and NVMe a bit, I think your point of '256 is a LOT'
is correct:

1) inside SSD, the channel number won't be big, which is often about 10
in high-end SSD, so 256 should be big enough to maximize the utilization
of each channel, even though multi-bank, multi-die or multi-plane are
considered.

2) For NVMe, the IO queue depth(size) is at most 64K according to spec,
now the driver limits it at most 1024, but the queue itself is totally
allocated from system memory, then big queue size won't increase NVMe
chip cost.

So I think we can respect .nr_requests via resizing hw tags in
blk_mq_init_sched(), as suggested by Omar. If you don't object that,
I will send out V3 soon.

Thanks,
Ming
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 27c67465f856..45a675f07b8b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,7 +83,12 @@  struct request *blk_mq_sched_get_request(struct request_queue *q,
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
 	if (e) {
-		data->flags |= BLK_MQ_REQ_INTERNAL;
+		/*
+		 * If BLK_MQ_F_SCHED_USE_HW_TAG is set, we use hardware
+		 * tag for IO scheduler directly.
+		 */
+		if (!(data->hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG))
+			data->flags |= BLK_MQ_REQ_INTERNAL;
 
 		/*
 		 * Flush requests are special and go directly to the
@@ -431,6 +436,9 @@  static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	struct blk_mq_tag_set *set = q->tag_set;
 	int ret;
 
+	if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG)
+		return 0;
+
 	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
 					       set->reserved_tags);
 	if (!hctx->sched_tags)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0168b27469cb..e530bc54f0d9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -263,9 +263,19 @@  struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 				rq->rq_flags = RQF_MQ_INFLIGHT;
 				atomic_inc(&data->hctx->nr_active);
 			}
-			rq->tag = tag;
-			rq->internal_tag = -1;
-			data->hctx->tags->rqs[rq->tag] = rq;
+			data->hctx->tags->rqs[tag] = rq;
+
+			/*
+			 * If we use hw tag for scheduling, postpone setting
+			 * rq->tag in blk_mq_get_driver_tag().
+			 */
+			if (data->hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
+				rq->tag = -1;
+				rq->internal_tag = tag;
+			} else {
+				rq->tag = tag;
+				rq->internal_tag = -1;
+			}
 		}
 
 		if (data->flags & BLK_MQ_REQ_RESERVED)
@@ -368,7 +378,7 @@  void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
+	if (sched_tag != -1 && !(hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG))
 		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
@@ -872,6 +882,12 @@  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 	if (rq->tag != -1)
 		goto done;
 
+	/* we buffered driver tag in rq->internal_tag */
+	if (data.hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG) {
+		rq->tag = rq->internal_tag;
+		goto done;
+	}
+
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
@@ -893,9 +909,15 @@  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 				    struct request *rq)
 {
-	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	unsigned tag = rq->tag;
+
 	rq->tag = -1;
 
+	if (hctx->flags & BLK_MQ_F_SCHED_USE_HW_TAG)
+		return;
+
+	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
@@ -2865,7 +2887,8 @@  bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 		blk_flush_plug_list(plug, false);
 
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
-	if (!blk_qc_t_is_internal(cookie))
+	if (!blk_qc_t_is_internal(cookie) || (hctx->flags &
+			BLK_MQ_F_SCHED_USE_HW_TAG))
 		rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
 	else {
 		rq = blk_mq_tag_to_rq(hctx->sched_tags, blk_qc_t_to_tag(cookie));
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 32bd8eb5ba67..53f24df91a05 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -162,6 +162,7 @@  enum {
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
+	BLK_MQ_F_SCHED_USE_HW_TAG	= 1 << 7,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,