diff mbox series

[v2] block: Do not pull requests from the scheduler when we cannot dispatch them

Message ID 20210603104721.6309-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series [v2] block: Do not pull requests from the scheduler when we cannot dispatch them | expand

Commit Message

Jan Kara June 3, 2021, 10:47 a.m. UTC
Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.

For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.

So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq-sched.c | 12 +++++++++++-
 block/blk-mq.c       |  2 +-
 block/blk-mq.h       |  2 ++
 3 files changed, 14 insertions(+), 2 deletions(-)

Jens, can you please merge the patch? Thanks!

Comments

Jens Axboe June 3, 2021, 6:01 p.m. UTC | #1
On 6/3/21 4:47 AM, Jan Kara wrote:
> Provided the device driver does not implement dispatch budget accounting
> (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
> requests from the IO scheduler as long as it is willing to give out any.
> That defeats scheduling heuristics inside the scheduler by creating
> false impression that the device can take more IO when it in fact
> cannot.
> 
> For example with BFQ IO scheduler on top of virtio-blk device setting
> blkio cgroup weight has barely any impact on observed throughput of
> async IO because __blk_mq_do_dispatch_sched() always sucks out all the
> IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
> when that is all dispatched, it will give out IO of lower weight cgroups
> as well. And then we have to wait for all this IO to be dispatched to
> the disk (which means lot of it actually has to complete) before the
> IO scheduler is queried again for dispatching more requests. This
> completely destroys any service differentiation.
> 
> So grab request tag for a request pulled out of the IO scheduler already
> in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
> cannot get it because we are unlikely to be able to dispatch it. That
> way only single request is going to wait in the dispatch list for some
> tag to free.

Applied to for-5.14/block, thanks.
Hannes Reinecke June 7, 2021, 10:05 a.m. UTC | #2
On 6/3/21 12:47 PM, Jan Kara wrote:
> Provided the device driver does not implement dispatch budget accounting
> (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
> requests from the IO scheduler as long as it is willing to give out any.
> That defeats scheduling heuristics inside the scheduler by creating
> false impression that the device can take more IO when it in fact
> cannot.
> 
> For example with BFQ IO scheduler on top of virtio-blk device setting
> blkio cgroup weight has barely any impact on observed throughput of
> async IO because __blk_mq_do_dispatch_sched() always sucks out all the
> IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
> when that is all dispatched, it will give out IO of lower weight cgroups
> as well. And then we have to wait for all this IO to be dispatched to
> the disk (which means lot of it actually has to complete) before the
> IO scheduler is queried again for dispatching more requests. This
> completely destroys any service differentiation.
> 
> So grab request tag for a request pulled out of the IO scheduler already
> in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
> cannot get it because we are unlikely to be able to dispatch it. That
> way only single request is going to wait in the dispatch list for some
> tag to free.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq-sched.c | 12 +++++++++++-
>  block/blk-mq.c       |  2 +-
>  block/blk-mq.h       |  2 ++
>  3 files changed, 14 insertions(+), 2 deletions(-)
> 
> Jens, can you please merge the patch? Thanks!
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 996a4b2f73aa..714e678f516a 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -168,9 +168,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		 * in blk_mq_dispatch_rq_list().
>  		 */
>  		list_add_tail(&rq->queuelist, &rq_list);
> +		count++;
>  		if (rq->mq_hctx != hctx)
>  			multi_hctxs = true;
> -	} while (++count < max_dispatch);
> +
> +		/*
> +		 * If we cannot get tag for the request, stop dequeueing
> +		 * requests from the IO scheduler. We are unlikely to be able
> +		 * to submit them anyway and it creates false impression for
> +		 * scheduling heuristics that the device can take more IO.
> +		 */
> +		if (!blk_mq_get_driver_tag(rq))
> +			break;
> +	} while (count < max_dispatch);
>  
>  	if (!count) {
>  		if (run_queue)

Doesn't this lead to a double accounting of the allocated tags?
From what I can see we don't really check if the tag is already
allocated in blk_mq_get_driver_tag() ...

Hmm?

Cheers,

Hannes
Jan Kara June 7, 2021, 11:41 a.m. UTC | #3
On Mon 07-06-21 12:05:52, Hannes Reinecke wrote:
> On 6/3/21 12:47 PM, Jan Kara wrote:
> > Provided the device driver does not implement dispatch budget accounting
> > (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
> > requests from the IO scheduler as long as it is willing to give out any.
> > That defeats scheduling heuristics inside the scheduler by creating
> > false impression that the device can take more IO when it in fact
> > cannot.
> > 
> > For example with BFQ IO scheduler on top of virtio-blk device setting
> > blkio cgroup weight has barely any impact on observed throughput of
> > async IO because __blk_mq_do_dispatch_sched() always sucks out all the
> > IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
> > when that is all dispatched, it will give out IO of lower weight cgroups
> > as well. And then we have to wait for all this IO to be dispatched to
> > the disk (which means lot of it actually has to complete) before the
> > IO scheduler is queried again for dispatching more requests. This
> > completely destroys any service differentiation.
> > 
> > So grab request tag for a request pulled out of the IO scheduler already
> > in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
> > cannot get it because we are unlikely to be able to dispatch it. That
> > way only single request is going to wait in the dispatch list for some
> > tag to free.
> > 
> > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  block/blk-mq-sched.c | 12 +++++++++++-
> >  block/blk-mq.c       |  2 +-
> >  block/blk-mq.h       |  2 ++
> >  3 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > Jens, can you please merge the patch? Thanks!
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 996a4b2f73aa..714e678f516a 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -168,9 +168,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  		 * in blk_mq_dispatch_rq_list().
> >  		 */
> >  		list_add_tail(&rq->queuelist, &rq_list);
> > +		count++;
> >  		if (rq->mq_hctx != hctx)
> >  			multi_hctxs = true;
> > -	} while (++count < max_dispatch);
> > +
> > +		/*
> > +		 * If we cannot get tag for the request, stop dequeueing
> > +		 * requests from the IO scheduler. We are unlikely to be able
> > +		 * to submit them anyway and it creates false impression for
> > +		 * scheduling heuristics that the device can take more IO.
> > +		 */
> > +		if (!blk_mq_get_driver_tag(rq))
> > +			break;
> > +	} while (count < max_dispatch);
> >  
> >  	if (!count) {
> >  		if (run_queue)
> 
> Doesn't this lead to a double accounting of the allocated tags?
> From what I can see we don't really check if the tag is already
> allocated in blk_mq_get_driver_tag() ...

I think we do check. blk_mq_get_driver_tag() has:

        if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_get_driver_tag(rq))
                return false;

        if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
                        !(rq->rq_flags & RQF_MQ_INFLIGHT)) {
                rq->rq_flags |= RQF_MQ_INFLIGHT;
                __blk_mq_inc_active_requests(hctx);
        }
        hctx->tags->rqs[rq->tag] = rq;
 
So once we call it, rq->tag will be != BLK_MQ_NO_TAG and RQF_MQ_INFLIGHT
will be set. So neither __blk_mq_get_driver_tag() nor
__blk_mq_inc_active_requests() will get repeated.

								Honza
Jan Kara June 16, 2021, 3:40 p.m. UTC | #4
On Thu 03-06-21 12:47:21, Jan Kara wrote:
> Provided the device driver does not implement dispatch budget accounting
> (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
> requests from the IO scheduler as long as it is willing to give out any.
> That defeats scheduling heuristics inside the scheduler by creating
> false impression that the device can take more IO when it in fact
> cannot.
> 
> For example with BFQ IO scheduler on top of virtio-blk device setting
> blkio cgroup weight has barely any impact on observed throughput of
> async IO because __blk_mq_do_dispatch_sched() always sucks out all the
> IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
> when that is all dispatched, it will give out IO of lower weight cgroups
> as well. And then we have to wait for all this IO to be dispatched to
> the disk (which means lot of it actually has to complete) before the
> IO scheduler is queried again for dispatching more requests. This
> completely destroys any service differentiation.
> 
> So grab request tag for a request pulled out of the IO scheduler already
> in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
> cannot get it because we are unlikely to be able to dispatch it. That
> way only single request is going to wait in the dispatch list for some
> tag to free.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-mq-sched.c | 12 +++++++++++-
>  block/blk-mq.c       |  2 +-
>  block/blk-mq.h       |  2 ++
>  3 files changed, 14 insertions(+), 2 deletions(-)
> 
> Jens, can you please merge the patch? Thanks!

Ping Jens?

								Honza

> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 996a4b2f73aa..714e678f516a 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -168,9 +168,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		 * in blk_mq_dispatch_rq_list().
>  		 */
>  		list_add_tail(&rq->queuelist, &rq_list);
> +		count++;
>  		if (rq->mq_hctx != hctx)
>  			multi_hctxs = true;
> -	} while (++count < max_dispatch);
> +
> +		/*
> +		 * If we cannot get tag for the request, stop dequeueing
> +		 * requests from the IO scheduler. We are unlikely to be able
> +		 * to submit them anyway and it creates false impression for
> +		 * scheduling heuristics that the device can take more IO.
> +		 */
> +		if (!blk_mq_get_driver_tag(rq))
> +			break;
> +	} while (count < max_dispatch);
>  
>  	if (!count) {
>  		if (run_queue)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c86c01bfecdb..bc2cf80d2c3b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1100,7 +1100,7 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
>  	return true;
>  }
>  
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +bool blk_mq_get_driver_tag(struct request *rq)
>  {
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 9ce64bc4a6c8..81a775171be7 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -259,6 +259,8 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
>  	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
>  }
>  
> +bool blk_mq_get_driver_tag(struct request *rq);
> +
>  static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>  {
>  	int cpu;
> -- 
> 2.26.2
>
Jens Axboe June 16, 2021, 3:43 p.m. UTC | #5
On 6/16/21 9:40 AM, Jan Kara wrote:
> On Thu 03-06-21 12:47:21, Jan Kara wrote:
>> Provided the device driver does not implement dispatch budget accounting
>> (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
>> requests from the IO scheduler as long as it is willing to give out any.
>> That defeats scheduling heuristics inside the scheduler by creating
>> false impression that the device can take more IO when it in fact
>> cannot.
>>
>> For example with BFQ IO scheduler on top of virtio-blk device setting
>> blkio cgroup weight has barely any impact on observed throughput of
>> async IO because __blk_mq_do_dispatch_sched() always sucks out all the
>> IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
>> when that is all dispatched, it will give out IO of lower weight cgroups
>> as well. And then we have to wait for all this IO to be dispatched to
>> the disk (which means lot of it actually has to complete) before the
>> IO scheduler is queried again for dispatching more requests. This
>> completely destroys any service differentiation.
>>
>> So grab request tag for a request pulled out of the IO scheduler already
>> in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
>> cannot get it because we are unlikely to be able to dispatch it. That
>> way only single request is going to wait in the dispatch list for some
>> tag to free.
>>
>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>>  block/blk-mq-sched.c | 12 +++++++++++-
>>  block/blk-mq.c       |  2 +-
>>  block/blk-mq.h       |  2 ++
>>  3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> Jens, can you please merge the patch? Thanks!
> 
> Ping Jens?

Didn't I reply? It's merged for a while:

https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.14/block&id=613471549f366cdf4170b81ce0f99f3867ec4d16
Jan Kara June 16, 2021, 3:51 p.m. UTC | #6
On Wed 16-06-21 09:43:06, Jens Axboe wrote:
> On 6/16/21 9:40 AM, Jan Kara wrote:
> > On Thu 03-06-21 12:47:21, Jan Kara wrote:
> >> Provided the device driver does not implement dispatch budget accounting
> >> (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
> >> requests from the IO scheduler as long as it is willing to give out any.
> >> That defeats scheduling heuristics inside the scheduler by creating
> >> false impression that the device can take more IO when it in fact
> >> cannot.
> >>
> >> For example with BFQ IO scheduler on top of virtio-blk device setting
> >> blkio cgroup weight has barely any impact on observed throughput of
> >> async IO because __blk_mq_do_dispatch_sched() always sucks out all the
> >> IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
> >> when that is all dispatched, it will give out IO of lower weight cgroups
> >> as well. And then we have to wait for all this IO to be dispatched to
> >> the disk (which means lot of it actually has to complete) before the
> >> IO scheduler is queried again for dispatching more requests. This
> >> completely destroys any service differentiation.
> >>
> >> So grab request tag for a request pulled out of the IO scheduler already
> >> in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
> >> cannot get it because we are unlikely to be able to dispatch it. That
> >> way only single request is going to wait in the dispatch list for some
> >> tag to free.
> >>
> >> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> >> Signed-off-by: Jan Kara <jack@suse.cz>
> >> ---
> >>  block/blk-mq-sched.c | 12 +++++++++++-
> >>  block/blk-mq.c       |  2 +-
> >>  block/blk-mq.h       |  2 ++
> >>  3 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> Jens, can you please merge the patch? Thanks!
> > 
> > Ping Jens?
> 
> Didn't I reply? It's merged for a while:

No, I didn't get email back and didn't notice the patch is already in your
tree. Sorry for the noise.

								Honza

> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.14/block&id=613471549f366cdf4170b81ce0f99f3867ec4d16
> 
> -- 
> Jens Axboe
>
diff mbox series

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 996a4b2f73aa..714e678f516a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -168,9 +168,19 @@  static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 * in blk_mq_dispatch_rq_list().
 		 */
 		list_add_tail(&rq->queuelist, &rq_list);
+		count++;
 		if (rq->mq_hctx != hctx)
 			multi_hctxs = true;
-	} while (++count < max_dispatch);
+
+		/*
+		 * If we cannot get tag for the request, stop dequeueing
+		 * requests from the IO scheduler. We are unlikely to be able
+		 * to submit them anyway and it creates false impression for
+		 * scheduling heuristics that the device can take more IO.
+		 */
+		if (!blk_mq_get_driver_tag(rq))
+			break;
+	} while (count < max_dispatch);
 
 	if (!count) {
 		if (run_queue)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c86c01bfecdb..bc2cf80d2c3b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1100,7 +1100,7 @@  static bool __blk_mq_get_driver_tag(struct request *rq)
 	return true;
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9ce64bc4a6c8..81a775171be7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -259,6 +259,8 @@  static inline void blk_mq_put_driver_tag(struct request *rq)
 	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
 }
 
+bool blk_mq_get_driver_tag(struct request *rq);
+
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;