diff mbox

[V7,4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

Message ID 20171012183704.22326-5-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Oct. 12, 2017, 6:37 p.m. UTC
For SCSI devices, there is often per-request-queue depth, which need
to be respected before queuing one request.

The current blk-mq always dequeues one request first, then calls .queue_rq()
to dispatch the request to lld. One obvious issue of this way is that I/O
merge may not be good, because when the per-request-queue depth can't be
respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
has to staty in hctx->dispatch list, and never got chance to participate
into I/O merge.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
Once we can't get budget for queueing I/O, we don't need to dequeue request
at all, then I/O merge can get improved a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 39 +++++++++++++++++++++++++++++++--------
 block/blk-mq.c         | 35 ++++++++++++++++++++++++++++++++---
 block/blk-mq.h         |  2 +-
 include/linux/blk-mq.h | 10 ++++++++++
 4 files changed, 74 insertions(+), 12 deletions(-)

Comments

Jens Axboe Oct. 12, 2017, 6:46 p.m. UTC | #1
On 10/12/2017 12:37 PM, Ming Lei wrote:
> For SCSI devices, there is often per-request-queue depth, which need
> to be respected before queuing one request.
> 
> The current blk-mq always dequeues one request first, then calls .queue_rq()
> to dispatch the request to lld. One obvious issue of this way is that I/O
> merge may not be good, because when the per-request-queue depth can't be
> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> has to staty in hctx->dispatch list, and never got chance to participate
> into I/O merge.
> 
> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> then we can try to get reserved budget first before dequeuing request.
> Once we can't get budget for queueing I/O, we don't need to dequeue request
> at all, then I/O merge can get improved a lot.

I can't help but think that it would be cleaner to just be able to
reinsert the request into the scheduler properly, if we fail to
dispatch it. Bart hinted at that earlier as well.

With that, you would not need budget functions.

I don't feel _that_ strongly about it, though, and it is something
we can do down the line as well. I'd just hate having to introduce
budget ops only to kill them a little later.

If we stick with the budget, then add a parallel blk_mq_get_budget()
like you have a blk_mq_put_budget(), so we don't have to litter the code
with things like:

> +		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> +			return true;

all over the place. There are also a few places where you don't use
blk_mq_put_budget() but open-code it instead, you should be consistent.
Ming Lei Oct. 13, 2017, 12:19 a.m. UTC | #2
On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> On 10/12/2017 12:37 PM, Ming Lei wrote:
> > For SCSI devices, there is often per-request-queue depth, which need
> > to be respected before queuing one request.
> > 
> > The current blk-mq always dequeues one request first, then calls .queue_rq()
> > to dispatch the request to lld. One obvious issue of this way is that I/O
> > merge may not be good, because when the per-request-queue depth can't be
> > respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> > has to staty in hctx->dispatch list, and never got chance to participate
> > into I/O merge.
> > 
> > This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> > then we can try to get reserved budget first before dequeuing request.
> > Once we can't get budget for queueing I/O, we don't need to dequeue request
> > at all, then I/O merge can get improved a lot.
> 
> I can't help but think that it would be cleaner to just be able to
> reinsert the request into the scheduler properly, if we fail to
> dispatch it. Bart hinted at that earlier as well.

Actually when I start to investigate the issue, the 1st thing I tried
is to reinsert, but that way is even worse on qla2xxx.

Once request is dequeued, the IO merge chance is decreased a lot.
With none scheduler, it becomes not possible to merge because
we only try to merge over the last 8 requests. With mq-deadline,
when one request is reinserted, another request may be dequeued
at the same time.

Not mention the cost of acquiring/releasing lock, that work
is just doing useless work and wasting CPU.

> 
> With that, you would not need budget functions.
> 
> I don't feel _that_ strongly about it, though, and it is something
> we can do down the line as well. I'd just hate having to introduce
> budget ops only to kill them a little later.
> 
> If we stick with the budget, then add a parallel blk_mq_get_budget()
> like you have a blk_mq_put_budget(), so we don't have to litter the code
> with things like:
> 
> > +		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> > +			return true;
> 
> all over the place. There are also a few places where you don't use
> blk_mq_put_budget() but open-code it instead, you should be consistent.

OK.
Jens Axboe Oct. 13, 2017, 2:44 p.m. UTC | #3
On 10/12/2017 06:19 PM, Ming Lei wrote:
> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>> For SCSI devices, there is often per-request-queue depth, which need
>>> to be respected before queuing one request.
>>>
>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>> merge may not be good, because when the per-request-queue depth can't be
>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>> has to staty in hctx->dispatch list, and never got chance to participate
>>> into I/O merge.
>>>
>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>> then we can try to get reserved budget first before dequeuing request.
>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>> at all, then I/O merge can get improved a lot.
>>
>> I can't help but think that it would be cleaner to just be able to
>> reinsert the request into the scheduler properly, if we fail to
>> dispatch it. Bart hinted at that earlier as well.
> 
> Actually when I start to investigate the issue, the 1st thing I tried
> is to reinsert, but that way is even worse on qla2xxx.
> 
> Once request is dequeued, the IO merge chance is decreased a lot.
> With none scheduler, it becomes not possible to merge because
> we only try to merge over the last 8 requests. With mq-deadline,
> when one request is reinserted, another request may be dequeued
> at the same time.

I don't care too much about 'none'. If perfect merging is crucial for
getting to the performance level you want on the hardware you are using,
you should not be using 'none'. 'none' will work perfectly fine for NVMe
etc style devices, where we are not dependent on merging to the same
extent that we are on other devices.

mq-deadline reinsertion will be expensive, that's in the nature of that
beast. It's basically the same as a normal request inserition.  So for
that, we'd have to be a bit careful not to run into this too much. Even
with a dumb approach, it should only happen 1 out of N times, where N is
the typical point at which the device will return STS_RESOURCE. The
reinsertion vs dequeue should be serialized with your patch to do that,
at least for the single queue mq-deadline setup. In fact, I think your
approach suffers from that same basic race, in that the budget isn't a
hard allocation, it's just a hint. It can change from the time you check
it, and when you go and dispatch the IO, if you don't serialize that
part. So really should be no different in that regard.

> Not mention the cost of acquiring/releasing lock, that work
> is just doing useless work and wasting CPU.

Sure, my point is that if it doesn't happen too often, it doesn't really
matter. It's not THAT expensive.
Ming Lei Oct. 13, 2017, 4:07 p.m. UTC | #4
On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> On 10/12/2017 06:19 PM, Ming Lei wrote:
> > On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>> For SCSI devices, there is often per-request-queue depth, which need
> >>> to be respected before queuing one request.
> >>>
> >>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>> merge may not be good, because when the per-request-queue depth can't be
> >>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>> has to staty in hctx->dispatch list, and never got chance to participate
> >>> into I/O merge.
> >>>
> >>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>> then we can try to get reserved budget first before dequeuing request.
> >>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>> at all, then I/O merge can get improved a lot.
> >>
> >> I can't help but think that it would be cleaner to just be able to
> >> reinsert the request into the scheduler properly, if we fail to
> >> dispatch it. Bart hinted at that earlier as well.
> > 
> > Actually when I start to investigate the issue, the 1st thing I tried
> > is to reinsert, but that way is even worse on qla2xxx.
> > 
> > Once request is dequeued, the IO merge chance is decreased a lot.
> > With none scheduler, it becomes not possible to merge because
> > we only try to merge over the last 8 requests. With mq-deadline,
> > when one request is reinserted, another request may be dequeued
> > at the same time.
> 
> I don't care too much about 'none'. If perfect merging is crucial for
> getting to the performance level you want on the hardware you are using,
> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> etc style devices, where we are not dependent on merging to the same
> extent that we are on other devices.
> 
> mq-deadline reinsertion will be expensive, that's in the nature of that
> beast. It's basically the same as a normal request inserition.  So for
> that, we'd have to be a bit careful not to run into this too much. Even
> with a dumb approach, it should only happen 1 out of N times, where N is
> the typical point at which the device will return STS_RESOURCE. The
> reinsertion vs dequeue should be serialized with your patch to do that,
> at least for the single queue mq-deadline setup. In fact, I think your
> approach suffers from that same basic race, in that the budget isn't a
> hard allocation, it's just a hint. It can change from the time you check
> it, and when you go and dispatch the IO, if you don't serialize that
> part. So really should be no different in that regard.

In case of SCSI, the .get_buget is done as atomic counting,
and it is completely effective to avoid unnecessary dequeue, please take
a look at patch 6.

> 
> > Not mention the cost of acquiring/releasing lock, that work
> > is just doing useless work and wasting CPU.
> 
> Sure, my point is that if it doesn't happen too often, it doesn't really
> matter. It's not THAT expensive.

Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
it is quite easy to trigger STS_RESOURCE.


Thanks,
Ming
Ming Lei Oct. 13, 2017, 4:17 p.m. UTC | #5
On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> On 10/12/2017 06:19 PM, Ming Lei wrote:
> > On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>> For SCSI devices, there is often per-request-queue depth, which need
> >>> to be respected before queuing one request.
> >>>
> >>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>> merge may not be good, because when the per-request-queue depth can't be
> >>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>> has to staty in hctx->dispatch list, and never got chance to participate
> >>> into I/O merge.
> >>>
> >>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>> then we can try to get reserved budget first before dequeuing request.
> >>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>> at all, then I/O merge can get improved a lot.
> >>
> >> I can't help but think that it would be cleaner to just be able to
> >> reinsert the request into the scheduler properly, if we fail to
> >> dispatch it. Bart hinted at that earlier as well.
> > 
> > Actually when I start to investigate the issue, the 1st thing I tried
> > is to reinsert, but that way is even worse on qla2xxx.
> > 
> > Once request is dequeued, the IO merge chance is decreased a lot.
> > With none scheduler, it becomes not possible to merge because
> > we only try to merge over the last 8 requests. With mq-deadline,
> > when one request is reinserted, another request may be dequeued
> > at the same time.
> 
> I don't care too much about 'none'. If perfect merging is crucial for
> getting to the performance level you want on the hardware you are using,
> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> etc style devices, where we are not dependent on merging to the same
> extent that we are on other devices.

We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
device, like NVMe, in my test, the big lock of mq-deadline has been
an issue for this kind of device, and none actually is better than
mq-deadline, even though its merge isn't good.

Thanks,
Ming
Jens Axboe Oct. 13, 2017, 4:19 p.m. UTC | #6
On 10/13/2017 10:07 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>> to be respected before queuing one request.
>>>>>
>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>> into I/O merge.
>>>>>
>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>> at all, then I/O merge can get improved a lot.
>>>>
>>>> I can't help but think that it would be cleaner to just be able to
>>>> reinsert the request into the scheduler properly, if we fail to
>>>> dispatch it. Bart hinted at that earlier as well.
>>>
>>> Actually when I start to investigate the issue, the 1st thing I tried
>>> is to reinsert, but that way is even worse on qla2xxx.
>>>
>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>> With none scheduler, it becomes not possible to merge because
>>> we only try to merge over the last 8 requests. With mq-deadline,
>>> when one request is reinserted, another request may be dequeued
>>> at the same time.
>>
>> I don't care too much about 'none'. If perfect merging is crucial for
>> getting to the performance level you want on the hardware you are using,
>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>> etc style devices, where we are not dependent on merging to the same
>> extent that we are on other devices.
>>
>> mq-deadline reinsertion will be expensive, that's in the nature of that
>> beast. It's basically the same as a normal request inserition.  So for
>> that, we'd have to be a bit careful not to run into this too much. Even
>> with a dumb approach, it should only happen 1 out of N times, where N is
>> the typical point at which the device will return STS_RESOURCE. The
>> reinsertion vs dequeue should be serialized with your patch to do that,
>> at least for the single queue mq-deadline setup. In fact, I think your
>> approach suffers from that same basic race, in that the budget isn't a
>> hard allocation, it's just a hint. It can change from the time you check
>> it, and when you go and dispatch the IO, if you don't serialize that
>> part. So really should be no different in that regard.
> 
> In case of SCSI, the .get_buget is done as atomic counting,
> and it is completely effective to avoid unnecessary dequeue, please take
> a look at patch 6.

Looks like you are right, I had initially misread that as just checking
the busy count. But you are actually getting the count at that point,
so it should be solid.

>>> Not mention the cost of acquiring/releasing lock, that work
>>> is just doing useless work and wasting CPU.
>>
>> Sure, my point is that if it doesn't happen too often, it doesn't really
>> matter. It's not THAT expensive.
> 
> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> it is quite easy to trigger STS_RESOURCE.

Ugh, that is low.

OK, I think we should just roll with this and see how far we can go. I'll
apply it for 4.15.
Jens Axboe Oct. 13, 2017, 4:20 p.m. UTC | #7
On 10/13/2017 10:17 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>> to be respected before queuing one request.
>>>>>
>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>> into I/O merge.
>>>>>
>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>> at all, then I/O merge can get improved a lot.
>>>>
>>>> I can't help but think that it would be cleaner to just be able to
>>>> reinsert the request into the scheduler properly, if we fail to
>>>> dispatch it. Bart hinted at that earlier as well.
>>>
>>> Actually when I start to investigate the issue, the 1st thing I tried
>>> is to reinsert, but that way is even worse on qla2xxx.
>>>
>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>> With none scheduler, it becomes not possible to merge because
>>> we only try to merge over the last 8 requests. With mq-deadline,
>>> when one request is reinserted, another request may be dequeued
>>> at the same time.
>>
>> I don't care too much about 'none'. If perfect merging is crucial for
>> getting to the performance level you want on the hardware you are using,
>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>> etc style devices, where we are not dependent on merging to the same
>> extent that we are on other devices.
> 
> We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
> device, like NVMe, in my test, the big lock of mq-deadline has been
> an issue for this kind of device, and none actually is better than
> mq-deadline, even though its merge isn't good.

Kyber should be able to fill that hole, hopefully.
Ming Lei Oct. 13, 2017, 4:21 p.m. UTC | #8
On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote:
> On 10/13/2017 10:07 AM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> >> On 10/12/2017 06:19 PM, Ming Lei wrote:
> >>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>>>> For SCSI devices, there is often per-request-queue depth, which need
> >>>>> to be respected before queuing one request.
> >>>>>
> >>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>>>> merge may not be good, because when the per-request-queue depth can't be
> >>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>>>> has to staty in hctx->dispatch list, and never got chance to participate
> >>>>> into I/O merge.
> >>>>>
> >>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>>>> then we can try to get reserved budget first before dequeuing request.
> >>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>>>> at all, then I/O merge can get improved a lot.
> >>>>
> >>>> I can't help but think that it would be cleaner to just be able to
> >>>> reinsert the request into the scheduler properly, if we fail to
> >>>> dispatch it. Bart hinted at that earlier as well.
> >>>
> >>> Actually when I start to investigate the issue, the 1st thing I tried
> >>> is to reinsert, but that way is even worse on qla2xxx.
> >>>
> >>> Once request is dequeued, the IO merge chance is decreased a lot.
> >>> With none scheduler, it becomes not possible to merge because
> >>> we only try to merge over the last 8 requests. With mq-deadline,
> >>> when one request is reinserted, another request may be dequeued
> >>> at the same time.
> >>
> >> I don't care too much about 'none'. If perfect merging is crucial for
> >> getting to the performance level you want on the hardware you are using,
> >> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> >> etc style devices, where we are not dependent on merging to the same
> >> extent that we are on other devices.
> >>
> >> mq-deadline reinsertion will be expensive, that's in the nature of that
> >> beast. It's basically the same as a normal request inserition.  So for
> >> that, we'd have to be a bit careful not to run into this too much. Even
> >> with a dumb approach, it should only happen 1 out of N times, where N is
> >> the typical point at which the device will return STS_RESOURCE. The
> >> reinsertion vs dequeue should be serialized with your patch to do that,
> >> at least for the single queue mq-deadline setup. In fact, I think your
> >> approach suffers from that same basic race, in that the budget isn't a
> >> hard allocation, it's just a hint. It can change from the time you check
> >> it, and when you go and dispatch the IO, if you don't serialize that
> >> part. So really should be no different in that regard.
> > 
> > In case of SCSI, the .get_buget is done as atomic counting,
> > and it is completely effective to avoid unnecessary dequeue, please take
> > a look at patch 6.
> 
> Looks like you are right, I had initially misread that as just checking
> the busy count. But you are actually getting the count at that point,
> so it should be solid.
> 
> >>> Not mention the cost of acquiring/releasing lock, that work
> >>> is just doing useless work and wasting CPU.
> >>
> >> Sure, my point is that if it doesn't happen too often, it doesn't really
> >> matter. It's not THAT expensive.
> > 
> > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> > it is quite easy to trigger STS_RESOURCE.
> 
> Ugh, that is low.
> 
> OK, I think we should just roll with this and see how far we can go. I'll
> apply it for 4.15.

OK, I have some update, will post a new version soon.

Thanks
Ming
Ming Lei Oct. 13, 2017, 4:22 p.m. UTC | #9
On Fri, Oct 13, 2017 at 10:20:01AM -0600, Jens Axboe wrote:
> On 10/13/2017 10:17 AM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
> >> On 10/12/2017 06:19 PM, Ming Lei wrote:
> >>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> >>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
> >>>>> For SCSI devices, there is often per-request-queue depth, which need
> >>>>> to be respected before queuing one request.
> >>>>>
> >>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
> >>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
> >>>>> merge may not be good, because when the per-request-queue depth can't be
> >>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> >>>>> has to staty in hctx->dispatch list, and never got chance to participate
> >>>>> into I/O merge.
> >>>>>
> >>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> >>>>> then we can try to get reserved budget first before dequeuing request.
> >>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
> >>>>> at all, then I/O merge can get improved a lot.
> >>>>
> >>>> I can't help but think that it would be cleaner to just be able to
> >>>> reinsert the request into the scheduler properly, if we fail to
> >>>> dispatch it. Bart hinted at that earlier as well.
> >>>
> >>> Actually when I start to investigate the issue, the 1st thing I tried
> >>> is to reinsert, but that way is even worse on qla2xxx.
> >>>
> >>> Once request is dequeued, the IO merge chance is decreased a lot.
> >>> With none scheduler, it becomes not possible to merge because
> >>> we only try to merge over the last 8 requests. With mq-deadline,
> >>> when one request is reinserted, another request may be dequeued
> >>> at the same time.
> >>
> >> I don't care too much about 'none'. If perfect merging is crucial for
> >> getting to the performance level you want on the hardware you are using,
> >> you should not be using 'none'. 'none' will work perfectly fine for NVMe
> >> etc style devices, where we are not dependent on merging to the same
> >> extent that we are on other devices.
> > 
> > We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
> > device, like NVMe, in my test, the big lock of mq-deadline has been
> > an issue for this kind of device, and none actually is better than
> > mq-deadline, even though its merge isn't good.
> 
> Kyber should be able to fill that hole, hopefully.

Yeah, kyber still uses same IO merge with none, :-)
Jens Axboe Oct. 13, 2017, 4:28 p.m. UTC | #10
On 10/13/2017 10:22 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 10:20:01AM -0600, Jens Axboe wrote:
>> On 10/13/2017 10:17 AM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>>>> to be respected before queuing one request.
>>>>>>>
>>>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>>>> into I/O merge.
>>>>>>>
>>>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>>>> at all, then I/O merge can get improved a lot.
>>>>>>
>>>>>> I can't help but think that it would be cleaner to just be able to
>>>>>> reinsert the request into the scheduler properly, if we fail to
>>>>>> dispatch it. Bart hinted at that earlier as well.
>>>>>
>>>>> Actually when I start to investigate the issue, the 1st thing I tried
>>>>> is to reinsert, but that way is even worse on qla2xxx.
>>>>>
>>>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>>>> With none scheduler, it becomes not possible to merge because
>>>>> we only try to merge over the last 8 requests. With mq-deadline,
>>>>> when one request is reinserted, another request may be dequeued
>>>>> at the same time.
>>>>
>>>> I don't care too much about 'none'. If perfect merging is crucial for
>>>> getting to the performance level you want on the hardware you are using,
>>>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>>>> etc style devices, where we are not dependent on merging to the same
>>>> extent that we are on other devices.
>>>
>>> We still have some SCSI device, such as qla2xxx, which is 1:1 multi-queue
>>> device, like NVMe, in my test, the big lock of mq-deadline has been
>>> an issue for this kind of device, and none actually is better than
>>> mq-deadline, even though its merge isn't good.
>>
>> Kyber should be able to fill that hole, hopefully.
> 
> Yeah, kyber still uses same IO merge with none, :-)

Doesn't mean it can't be changed... 'none' has to remain with very low
overhead, any extra smarts or logic should be a scheduler thing.
Bart Van Assche Oct. 13, 2017, 4:31 p.m. UTC | #11
On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,


Sorry but I doubt whether that is correct. More in general, I don't know any modern
storage HBA for which the default queue depth is so low.

Bart.
Jens Axboe Oct. 13, 2017, 4:33 p.m. UTC | #12
On 10/13/2017 10:31 AM, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> 
> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> storage HBA for which the default queue depth is so low.

That does seem insanely low. BTW, even a low depth isn't an issue, as long as
we know what it roughly is.
Ming Lei Oct. 13, 2017, 4:45 p.m. UTC | #13
On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> 
> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> storage HBA for which the default queue depth is so low.

You can grep:

[ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"

Even SRP/IB isn't big too, just 32.
Bart Van Assche Oct. 13, 2017, 5:08 p.m. UTC | #14
On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:

> > On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:

> > > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,

> > 

> > Sorry but I doubt whether that is correct. More in general, I don't know any modern

> > storage HBA for which the default queue depth is so low.

> 

> You can grep:

> 

> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"


Such a low queue depth will result in suboptimal performance for adapters
that communicate over a storage network. I think that's a bug and that both
adapters support much higher cmd_per_lun values.

(+James Smart)

James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
from 30 to 3? Was that perhaps a workaround for a bug in a specific target
implementation?

(+Himanshu Madhani)

Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
the qla2xxx initiator driver to the scsi_host->can_queue value?

> Even SRP/IB isn't big too, just 32.


The default value for ib_srp for cmd_per_lun is 62 but that value can be
overridden by selecting another value in /etc/default/srp_daemon.conf. Note:
a lower value is selected if after SRP login it becomes clear that the target
queue depth is lower than the cmd_per_lun value requested by the user. This
is a performance optimization and avoids that the SRP target system has to
send back BUSY responses to the initiator.

Bart.
Ming Lei Oct. 13, 2017, 5:29 p.m. UTC | #15
On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> > > On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> > > > Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> > > 
> > > Sorry but I doubt whether that is correct. More in general, I don't know any modern
> > > storage HBA for which the default queue depth is so low.
> > 
> > You can grep:
> > 
> > [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> 
> Such a low queue depth will result in suboptimal performance for adapters
> that communicate over a storage network. I think that's a bug and that both
> adapters support much higher cmd_per_lun values.
> 
> (+James Smart)
> 
> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> implementation?
> 
> (+Himanshu Madhani)
> 
> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> the qla2xxx initiator driver to the scsi_host->can_queue value?

->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
reasonable to increase cmd_per_lun to .can_queue.

> 
> > Even SRP/IB isn't big too, just 32.
> 
> The default value for ib_srp for cmd_per_lun is 62 but that value can be
> overridden by selecting another value in /etc/default/srp_daemon.conf. Note:
> a lower value is selected if after SRP login it becomes clear that the target
> queue depth is lower than the cmd_per_lun value requested by the user. This
> is a performance optimization and avoids that the SRP target system has to
> send back BUSY responses to the initiator.

OK, thanks for sharing, I just read it as 32 in Laurence's machine.
Bart Van Assche Oct. 13, 2017, 5:47 p.m. UTC | #16
On Sat, 2017-10-14 at 01:29 +0800, Ming Lei wrote:
> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't

> reasonable to increase cmd_per_lun to .can_queue.


Sorry but I disagree. Setting cmd_per_lun to a value lower than can_queue
will result in suboptimal performance if there is only a single LUN per SCSI
host. If there are multiple LUNs per SCSI host then the blk-mq core tracks
the number of active LUNs through the blk_mq_tags.active_queues variable.
See also hctx_may_queue(). The comment above that function is as follows:

/*
 * For shared tag users, we track the number of currently active users
 * and attempt to provide a fair share of the tag depth for each of them.
 */

BTW, the ib_srp initiator driver sets cmd_per_lun to can_queue and is able
to achieve more than one million IOPS even in tests with multiple LUNs per
SCSI host.

Bart.
Hannes Reinecke Oct. 16, 2017, 11:30 a.m. UTC | #17
On 10/13/2017 07:29 PM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>>>
>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
>>>> storage HBA for which the default queue depth is so low.
>>>
>>> You can grep:
>>>
>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
>>
>> Such a low queue depth will result in suboptimal performance for adapters
>> that communicate over a storage network. I think that's a bug and that both
>> adapters support much higher cmd_per_lun values.
>>
>> (+James Smart)
>>
>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
>> implementation?
>>
>> (+Himanshu Madhani)
>>
>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
>> the qla2xxx initiator driver to the scsi_host->can_queue value?
> 
> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> reasonable to increase cmd_per_lun to .can_queue.
> 
'3' is just a starting point; later on it'll be adjusted via
scsi_change_depth().
Looks like it's not working correctly with blk-mq, though.

Cheers,

Hannes
Bart Van Assche Oct. 16, 2017, 4:06 p.m. UTC | #18
On Mon, 2017-10-16 at 13:30 +0200, Hannes Reinecke wrote:
> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:

> > (+Himanshu Madhani)

> > 

> > Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for

> > the qla2xxx initiator driver to the scsi_host->can_queue value?

> 

> '3' is just a starting point; later on it'll be adjusted via scsi_change_depth().

> Looks like it's not working correctly with blk-mq, though.


Hello Hannes,

Isn't the purpose of scsi_change_queue_depth() to change the sdev queue_depth
only? As far as I know there is only one assignment of shost->cmd_per_lun in
the SCSI core and that's the following assignment in scsi_host_alloc():

	shost->cmd_per_lun = sht->cmd_per_lun;

Bart.
Ming Lei Oct. 17, 2017, 1:29 a.m. UTC | #19
On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
> On 10/13/2017 07:29 PM, Ming Lei wrote:
> > On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> >> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> >>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> >>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> >>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> >>>>
> >>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> >>>> storage HBA for which the default queue depth is so low.
> >>>
> >>> You can grep:
> >>>
> >>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> >>
> >> Such a low queue depth will result in suboptimal performance for adapters
> >> that communicate over a storage network. I think that's a bug and that both
> >> adapters support much higher cmd_per_lun values.
> >>
> >> (+James Smart)
> >>
> >> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> >> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> >> implementation?
> >>
> >> (+Himanshu Madhani)
> >>
> >> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> >> the qla2xxx initiator driver to the scsi_host->can_queue value?
> > 
> > ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> > reasonable to increase cmd_per_lun to .can_queue.
> > 
> '3' is just a starting point; later on it'll be adjusted via
> scsi_change_depth().
> Looks like it's not working correctly with blk-mq, though.

At default, in scsi_alloc_sdev(), q->queue_depth is set as
host->cmd_per_lun. You are right, q->queue_depth can be adjusted
later too.

q->queue_depth is respected in scsi_dev_queue_ready().
.cmd_per_lun defines the max outstanding cmds for each lun, I
guess it is respected by some hardware inside.

For example, I remembered that on lpfc q->queue_depth is 30 because
the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3.
Per my observation, this .cmd_per_lun limit is still workable.
Hannes Reinecke Oct. 17, 2017, 6:38 a.m. UTC | #20
On 10/17/2017 03:29 AM, Ming Lei wrote:
> On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
>> On 10/13/2017 07:29 PM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
>>>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
>>>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
>>>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
>>>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>>>>>
>>>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
>>>>>> storage HBA for which the default queue depth is so low.
>>>>>
>>>>> You can grep:
>>>>>
>>>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
>>>>
>>>> Such a low queue depth will result in suboptimal performance for adapters
>>>> that communicate over a storage network. I think that's a bug and that both
>>>> adapters support much higher cmd_per_lun values.
>>>>
>>>> (+James Smart)
>>>>
>>>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
>>>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
>>>> implementation?
>>>>
>>>> (+Himanshu Madhani)
>>>>
>>>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
>>>> the qla2xxx initiator driver to the scsi_host->can_queue value?
>>>
>>> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
>>> reasonable to increase cmd_per_lun to .can_queue.
>>>
>> '3' is just a starting point; later on it'll be adjusted via
>> scsi_change_depth().
>> Looks like it's not working correctly with blk-mq, though.
> 
> At default, in scsi_alloc_sdev(), q->queue_depth is set as
> host->cmd_per_lun. You are right, q->queue_depth can be adjusted
> later too.
> 
> q->queue_depth is respected in scsi_dev_queue_ready().
> .cmd_per_lun defines the max outstanding cmds for each lun, I
> guess it is respected by some hardware inside.
> 
No, this is purely a linux abstraction. Nothing to do with the hardware.

> For example, I remembered that on lpfc q->queue_depth is 30 because
> the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3.
> Per my observation, this .cmd_per_lun limit is still workable.
> 
Again, these are just some pre-defined values to avoid I/O starvation
when having several LUNs. _if_ we can guarantee I/O fairness between
several (hundreds!) devices all sharing the same tagspace we wouldn't
need these variables.

Cheers,

Hannes
Ming Lei Oct. 17, 2017, 9:36 a.m. UTC | #21
On Tue, Oct 17, 2017 at 08:38:01AM +0200, Hannes Reinecke wrote:
> On 10/17/2017 03:29 AM, Ming Lei wrote:
> > On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote:
> >> On 10/13/2017 07:29 PM, Ming Lei wrote:
> >>> On Fri, Oct 13, 2017 at 05:08:52PM +0000, Bart Van Assche wrote:
> >>>> On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote:
> >>>>> On Fri, Oct 13, 2017 at 04:31:04PM +0000, Bart Van Assche wrote:
> >>>>>> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote:
> >>>>>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
> >>>>>>
> >>>>>> Sorry but I doubt whether that is correct. More in general, I don't know any modern
> >>>>>> storage HBA for which the default queue depth is so low.
> >>>>>
> >>>>> You can grep:
> >>>>>
> >>>>> [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E "qla2xxx|lpfc"
> >>>>
> >>>> Such a low queue depth will result in suboptimal performance for adapters
> >>>> that communicate over a storage network. I think that's a bug and that both
> >>>> adapters support much higher cmd_per_lun values.
> >>>>
> >>>> (+James Smart)
> >>>>
> >>>> James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN
> >>>> from 30 to 3? Was that perhaps a workaround for a bug in a specific target
> >>>> implementation?
> >>>>
> >>>> (+Himanshu Madhani)
> >>>>
> >>>> Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for
> >>>> the qla2xxx initiator driver to the scsi_host->can_queue value?
> >>>
> >>> ->can_queue is size of the whole tag space shared by all LUNs, looks it isn't
> >>> reasonable to increase cmd_per_lun to .can_queue.
> >>>
> >> '3' is just a starting point; later on it'll be adjusted via
> >> scsi_change_depth().
> >> Looks like it's not working correctly with blk-mq, though.
> > 
> > At default, in scsi_alloc_sdev(), q->queue_depth is set as
> > host->cmd_per_lun. You are right, q->queue_depth can be adjusted
> > later too.
> > 
> > q->queue_depth is respected in scsi_dev_queue_ready().
> > .cmd_per_lun defines the max outstanding cmds for each lun, I
> > guess it is respected by some hardware inside.
> > 
> No, this is purely a linux abstraction. Nothing to do with the hardware.

That is also my initial understanding.

But my test showed that actually the max outstanding cmds per LUN is
really 3 even though q->queue_depth is 30, that is why I guess the
hardware may put a hard limit inside:

	https://marc.info/?l=linux-block&m=150549401611868&w=2

Also if they were same thing, why does lpfc define different
default value for q->queue_depth and .cmd_per_lun?

drivers/scsi/lpfc/lpfc_attr.c: 3411
	/*
	# lun_queue_depth:  This parameter is used to limit the number of outstanding
	# commands per FCP LUN. Value range is [1,512]. Default value is 30.
	# If this parameter value is greater than 1/8th the maximum number of exchanges
	# supported by the HBA port, then the lun queue depth will be reduced to
	# 1/8th the maximum number of exchanges.
	*/
	LPFC_VPORT_ATTR_R(lun_queue_depth, 30, 1, 512,
	                  "Max number of FCP commands we can queue to a specific LUN");

drivers/scsi/lpfc/lpfc.h: 47
	#define LPFC_CMD_PER_LUN        3       /* max outstanding cmds per lun */
Bart Van Assche Oct. 17, 2017, 6:09 p.m. UTC | #22
On 10/16/17 23:38, Hannes Reinecke wrote:
> Again, these are just some pre-defined values to avoid I/O starvation
> when having several LUNs. _if_ we can guarantee I/O fairness between
> several (hundreds!) devices all sharing the same tagspace we wouldn't
> need these variables.

I think we already have two mechanisms for guaranteeing I/O fairness:
* Scsi_Host.starved_list for both the legacy SCSI core and scsi-mq.
* For scsi-mq, the active_queues counter in the blk-mq core. See also
   hctx_may_queue().

Bart.
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index be29ba849408..bcf4773ee1ca 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,19 +89,32 @@  static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
 
 	do {
-		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+		struct request *rq;
 
-		if (!rq)
+		if (e->type->ops.mq.has_work &&
+				!e->type->ops.mq.has_work(hctx))
 			break;
+
+		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+			return true;
+
+		rq = e->type->ops.mq.dispatch_request(hctx);
+		if (!rq) {
+			if (q->mq_ops->put_budget)
+				q->mq_ops->put_budget(hctx);
+			break;
+		}
 		list_add(&rq->queuelist, &rq_list);
-	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+	} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+	return false;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -110,6 +123,7 @@  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	LIST_HEAD(rq_list);
+	bool run_queue = false;
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -143,13 +157,22 @@  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
-			blk_mq_do_dispatch_sched(hctx);
+		if (blk_mq_dispatch_rq_list(q, &rq_list, false) &&
+				has_sched_dispatch)
+			run_queue = blk_mq_do_dispatch_sched(hctx);
 	} else if (has_sched_dispatch) {
-		blk_mq_do_dispatch_sched(hctx);
+		run_queue = blk_mq_do_dispatch_sched(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(q, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list, false);
+	}
+
+	if (run_queue) {
+		if (!blk_mq_sched_needs_restart(hctx) &&
+		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) {
+			blk_mq_sched_mark_restart_hctx(hctx);
+			blk_mq_run_hw_queue(hctx, true);
+		}
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..e6e50a91f170 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1045,7 +1045,16 @@  static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
+static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (q->mq_ops->put_budget && got_budget)
+		q->mq_ops->put_budget(hctx);
+}
+
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
+		bool got_budget)
 {
 	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
@@ -1054,6 +1063,8 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 	if (list_empty(list))
 		return false;
 
+	WARN_ON(!list_is_singular(list) && got_budget);
+
 	/*
 	 * Now process all the entries, sending them to the driver.
 	 */
@@ -1071,15 +1082,25 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (!blk_mq_dispatch_wait_add(hctx))
+			if (!blk_mq_dispatch_wait_add(hctx)) {
+				blk_mq_put_budget(hctx, got_budget);
 				break;
+			}
 
 			/*
 			 * It's possible that a tag was freed in the window
 			 * between the allocation failure and adding the
 			 * hardware queue to the wait queue.
 			 */
-			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+			if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
+				blk_mq_put_budget(hctx, got_budget);
+				break;
+			}
+		}
+
+		if (!got_budget) {
+			if (q->mq_ops->get_budget &&
+					!q->mq_ops->get_budget(hctx))
 				break;
 		}
 
@@ -1577,6 +1598,11 @@  static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
+	if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) {
+		blk_mq_put_driver_tag(rq);
+		goto insert;
+	}
+
 	new_cookie = request_to_qc_t(hctx, rq);
 
 	/*
@@ -2577,6 +2603,9 @@  int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->ops->queue_rq)
 		return -EINVAL;
 
+	if ((!!set->ops->get_budget) != (!!set->ops->put_budget))
+		return -EINVAL;
+
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
 		pr_info("blk-mq: reduced tag depth to %u\n",
 			BLK_MQ_MAX_DEPTH);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..d1bfce3e69ad 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@  void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..f8c4ba2682ec 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -90,6 +90,8 @@  struct blk_mq_queue_data {
 
 typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
 		const struct blk_mq_queue_data *);
+typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
+typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -112,6 +114,14 @@  struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
+	 * Reserve budget before queue request, once .queue_rq is
+	 * run, it is driver's responsibility to release the
+	 * reserved budget.
+	 */
+	get_budget_fn		*get_budget;
+	put_budget_fn		*put_budget;
+
+	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;