Message ID | 1533009735-2221-1-git-send-email-jianchao.w.wang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] blk-mq: clean up the hctx restart | expand |
On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: > Currently, we will always set SCHED_RESTART whenever there are > requests in hctx->dispatch, then when request is completed and > freed the hctx queues will be restarted to avoid IO hang. This > is unnecessary most of time. Especially when there are lots of > LUNs attached to one host, the RR restart loop could be very > expensive. The big RR restart loop has been killed in the following commit: commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d Author: Ming Lei <ming.lei@redhat.com> Date: Mon Jun 25 19:31:48 2018 +0800 blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() Thanks, Ming
Hi Ming On 07/31/2018 12:58 PM, Ming Lei wrote: > On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: >> Currently, we will always set SCHED_RESTART whenever there are >> requests in hctx->dispatch, then when request is completed and >> freed the hctx queues will be restarted to avoid IO hang. This >> is unnecessary most of time. Especially when there are lots of >> LUNs attached to one host, the RR restart loop could be very >> expensive. > > The big RR restart loop has been killed in the following commit: > > commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d > Author: Ming Lei <ming.lei@redhat.com> > Date: Mon Jun 25 19:31:48 2018 +0800 > > blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() > > Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, therefore I didn't realize the RR restart loop has already been killed. :) The RR restart loop could ensure the fairness of sharing some LLDD resource, not just avoid IO hung. Is it OK to kill it totally ? Thanks Jianchao
On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote: > Hi Ming > > On 07/31/2018 12:58 PM, Ming Lei wrote: > > On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: > >> Currently, we will always set SCHED_RESTART whenever there are > >> requests in hctx->dispatch, then when request is completed and > >> freed the hctx queues will be restarted to avoid IO hang. This > >> is unnecessary most of time. Especially when there are lots of > >> LUNs attached to one host, the RR restart loop could be very > >> expensive. > > > > The big RR restart loop has been killed in the following commit: > > > > commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d > > Author: Ming Lei <ming.lei@redhat.com> > > Date: Mon Jun 25 19:31:48 2018 +0800 > > > > blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() > > > > > > Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, > therefore I didn't realize the RR restart loop has already been killed. :) > > The RR restart loop could ensure the fairness of sharing some LLDD resource, > not just avoid IO hung. Is it OK to kill it totally ? Yeah, it is, also the fairness might be improved a bit by the way in commit 97889f9ac24f8d2fc, especially inside driver tag allocation algorithem. Thanks, Ming
Hi Ming Thanks for your kindly response. On 07/31/2018 02:16 PM, Ming Lei wrote: > On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote: >> Hi Ming >> >> On 07/31/2018 12:58 PM, Ming Lei wrote: >>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: >>>> Currently, we will always set SCHED_RESTART whenever there are >>>> requests in hctx->dispatch, then when request is completed and >>>> freed the hctx queues will be restarted to avoid IO hang. This >>>> is unnecessary most of time. Especially when there are lots of >>>> LUNs attached to one host, the RR restart loop could be very >>>> expensive. >>> >>> The big RR restart loop has been killed in the following commit: >>> >>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d >>> Author: Ming Lei <ming.lei@redhat.com> >>> Date: Mon Jun 25 19:31:48 2018 +0800 >>> >>> blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() >>> >>> >> >> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, >> therefore I didn't realize the RR restart loop has already been killed. :) >> >> The RR restart loop could ensure the fairness of sharing some LLDD resource, >> not just avoid IO hung. Is it OK to kill it totally ? > > Yeah, it is, also the fairness might be improved a bit by the way in > commit 97889f9ac24f8d2fc, especially inside driver tag allocation > algorithem. > Would you mind to detail more here ? Regarding the driver tag case: For example: q_a q_b q_c q_d hctx0 hctx0 hctx0 hctx0 tags Total number of tags is 32 All of these 4 q are active. So every q has 8 tags. If all of these 4 q have used up their 8 tags, they have to wait. When part of the in-flight requests q_a are completed, tags are freed. but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b. However, due to the limits in hctx_may_queue, q_b still cannot get the tags. The RR restart also will not wake up q_a. This is unfair for q_a. When we remove RR restart fashion, at least, the q_a will be waked up by the hctx restart. Is this the improvement of fairness you said in driver tag allocation ? Think further, it seems that it only works for case with io scheduler. w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will not work there. Thanks Jianchao
On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote: > Hi Ming > > Thanks for your kindly response. > > On 07/31/2018 02:16 PM, Ming Lei wrote: > > On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote: > >> Hi Ming > >> > >> On 07/31/2018 12:58 PM, Ming Lei wrote: > >>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: > >>>> Currently, we will always set SCHED_RESTART whenever there are > >>>> requests in hctx->dispatch, then when request is completed and > >>>> freed the hctx queues will be restarted to avoid IO hang. This > >>>> is unnecessary most of time. Especially when there are lots of > >>>> LUNs attached to one host, the RR restart loop could be very > >>>> expensive. > >>> > >>> The big RR restart loop has been killed in the following commit: > >>> > >>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d > >>> Author: Ming Lei <ming.lei@redhat.com> > >>> Date: Mon Jun 25 19:31:48 2018 +0800 > >>> > >>> blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() > >>> > >>> > >> > >> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, > >> therefore I didn't realize the RR restart loop has already been killed. :) > >> > >> The RR restart loop could ensure the fairness of sharing some LLDD resource, > >> not just avoid IO hung. Is it OK to kill it totally ? > > > > Yeah, it is, also the fairness might be improved a bit by the way in > > commit 97889f9ac24f8d2fc, especially inside driver tag allocation > > algorithem. > > > > Would you mind to detail more here ? > > Regarding the driver tag case: > For example: > > q_a q_b q_c q_d > hctx0 hctx0 hctx0 hctx0 > > tags > > Total number of tags is 32 > All of these 4 q are active. > > So every q has 8 tags. > > If all of these 4 q have used up their 8 tags, they have to wait. > > When part of the in-flight requests q_a are completed, tags are freed. > but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b. 1) in case of IO scheduler q_a should be waken up because q_a->hctx0 is added to one wq of the tags if no tag is available, see blk_mq_mark_tag_wait(). 2) in case of none scheduler q_a should be waken up too, see blk_mq_get_tag(). So I don't understand why you mentioned that q_a can't be waken up. > However, due to the limits in hctx_may_queue, q_b still cannot get the > tags. The RR restart also will not wake up q_a. > This is unfair for q_a. > > When we remove RR restart fashion, at least, the q_a will be waked up by > the hctx restart. > Is this the improvement of fairness you said in driver tag allocation ? I mean the fairness is totally covered by the general tag allocation algorithm now, which is sort of FIFO style because of waitqueue, but RR restart wakes up queue in the order of request queue. > > Think further, it seems that it only works for case with io scheduler. > w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will > not work there. When one tag is freed, the sbitmap queue will be waken up, then some of allocation may be satisfied, this way works for both IO sched and none. Thanks, Ming
Hi Ming On 08/01/2018 04:58 PM, Ming Lei wrote: > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote: >> Hi Ming >> >> Thanks for your kindly response. >> >> On 07/31/2018 02:16 PM, Ming Lei wrote: >>> On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote: >>>> Hi Ming >>>> >>>> On 07/31/2018 12:58 PM, Ming Lei wrote: >>>>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: >>>>>> Currently, we will always set SCHED_RESTART whenever there are >>>>>> requests in hctx->dispatch, then when request is completed and >>>>>> freed the hctx queues will be restarted to avoid IO hang. This >>>>>> is unnecessary most of time. Especially when there are lots of >>>>>> LUNs attached to one host, the RR restart loop could be very >>>>>> expensive. >>>>> >>>>> The big RR restart loop has been killed in the following commit: >>>>> >>>>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d >>>>> Author: Ming Lei <ming.lei@redhat.com> >>>>> Date: Mon Jun 25 19:31:48 2018 +0800 >>>>> >>>>> blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() >>>>> >>>>> >>>> >>>> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, >>>> therefore I didn't realize the RR restart loop has already been killed. :) >>>> >>>> The RR restart loop could ensure the fairness of sharing some LLDD resource, >>>> not just avoid IO hung. Is it OK to kill it totally ? >>> >>> Yeah, it is, also the fairness might be improved a bit by the way in >>> commit 97889f9ac24f8d2fc, especially inside driver tag allocation >>> algorithem. >>> >> >> Would you mind to detail more here ? >> >> Regarding the driver tag case: >> For example: >> >> q_a q_b q_c q_d >> hctx0 hctx0 hctx0 hctx0 >> >> tags >> >> Total number of tags is 32 >> All of these 4 q are active. >> >> So every q has 8 tags. >> >> If all of these 4 q have used up their 8 tags, they have to wait. >> >> When part of the in-flight requests q_a are completed, tags are freed. >> but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b. > > 1) in case of IO scheduler > q_a should be waken up because q_a->hctx0 is added to one wq of the tags if > no tag is available, see blk_mq_mark_tag_wait(). > > 2) in case of none scheduler > q_a should be waken up too, see blk_mq_get_tag(). > > So I don't understand why you mentioned that q_a can't be waken up. There are multiple sbq_wait_states in one sbitmap_queue and __sbq_wake_up will only wake up the waiters on one of them one time. Please refer to __sbq_wake_up. > >> However, due to the limits in hctx_may_queue, q_b still cannot get the >> tags. The RR restart also will not wake up q_a. >> This is unfair for q_a. >> >> When we remove RR restart fashion, at least, the q_a will be waked up by >> the hctx restart. >> Is this the improvement of fairness you said in driver tag allocation ? > > I mean the fairness is totally covered by the general tag allocation > algorithm now, which is sort of FIFO style because of waitqueue, but RR > restart wakes up queue in the order of request queue. Yes, I got your point. > >> >> Think further, it seems that it only works for case with io scheduler. >> w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will >> not work there. > > When one tag is freed, the sbitmap queue will be waken up, then some of > allocation may be satisfied, this way works for both IO sched and none. > > Thanks, > Ming >
On Wed, Aug 01, 2018 at 09:37:08PM +0800, jianchao.wang wrote: > Hi Ming > > On 08/01/2018 04:58 PM, Ming Lei wrote: > > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote: > >> Hi Ming > >> > >> Thanks for your kindly response. > >> > >> On 07/31/2018 02:16 PM, Ming Lei wrote: > >>> On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote: > >>>> Hi Ming > >>>> > >>>> On 07/31/2018 12:58 PM, Ming Lei wrote: > >>>>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote: > >>>>>> Currently, we will always set SCHED_RESTART whenever there are > >>>>>> requests in hctx->dispatch, then when request is completed and > >>>>>> freed the hctx queues will be restarted to avoid IO hang. This > >>>>>> is unnecessary most of time. Especially when there are lots of > >>>>>> LUNs attached to one host, the RR restart loop could be very > >>>>>> expensive. > >>>>> > >>>>> The big RR restart loop has been killed in the following commit: > >>>>> > >>>>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d > >>>>> Author: Ming Lei <ming.lei@redhat.com> > >>>>> Date: Mon Jun 25 19:31:48 2018 +0800 > >>>>> > >>>>> blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() > >>>>> > >>>>> > >>>> > >>>> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list, > >>>> therefore I didn't realize the RR restart loop has already been killed. :) > >>>> > >>>> The RR restart loop could ensure the fairness of sharing some LLDD resource, > >>>> not just avoid IO hung. Is it OK to kill it totally ? > >>> > >>> Yeah, it is, also the fairness might be improved a bit by the way in > >>> commit 97889f9ac24f8d2fc, especially inside driver tag allocation > >>> algorithem. > >>> > >> > >> Would you mind to detail more here ? > >> > >> Regarding the driver tag case: > >> For example: > >> > >> q_a q_b q_c q_d > >> hctx0 hctx0 hctx0 hctx0 > >> > >> tags > >> > >> Total number of tags is 32 > >> All of these 4 q are active. > >> > >> So every q has 8 tags. > >> > >> If all of these 4 q have used up their 8 tags, they have to wait. > >> > >> When part of the in-flight requests q_a are completed, tags are freed. > >> but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b. > > > > 1) in case of IO scheduler > > q_a should be waken up because q_a->hctx0 is added to one wq of the tags if > > no tag is available, see blk_mq_mark_tag_wait(). > > > > 2) in case of none scheduler > > q_a should be waken up too, see blk_mq_get_tag(). > > > > So I don't understand why you mentioned that q_a can't be waken up. > > There are multiple sbq_wait_states in one sbitmap_queue and __sbq_wake_up > will only wake up the waiters on one of them one time. Please refer to __sbq_wake_up. Yes, the multiple wqs are waken up in RR style, which is still fair generally speaking. And there is no such issue of always not waking up 'q_a' when request is completed on this queue, is there? Thanks, Ming
On Wed, 2018-08-01 at 16:58 +0800, Ming Lei wrote: > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote: > > However, due to the limits in hctx_may_queue, q_b still cannot get the > > tags. The RR restart also will not wake up q_a. > > This is unfair for q_a. > > > > When we remove RR restart fashion, at least, the q_a will be waked up by > > the hctx restart. > > Is this the improvement of fairness you said in driver tag allocation ? > > I mean the fairness is totally covered by the general tag allocation > algorithm now, which is sort of FIFO style because of waitqueue, but RR > restart wakes up queue in the order of request queue. From sbitmap.h: #define SBQ_WAIT_QUEUES 8 What do you think is the effect of your patch if more than eight LUNs are active and the SCSI queue is full? Thanks, Bart.
On Thu, Aug 02, 2018 at 03:52:00PM +0000, Bart Van Assche wrote: > On Wed, 2018-08-01 at 16:58 +0800, Ming Lei wrote: > > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote: > > > However, due to the limits in hctx_may_queue, q_b still cannot get the > > > tags. The RR restart also will not wake up q_a. > > > This is unfair for q_a. > > > > > > When we remove RR restart fashion, at least, the q_a will be waked up by > > > the hctx restart. > > > Is this the improvement of fairness you said in driver tag allocation ? > > > > I mean the fairness is totally covered by the general tag allocation > > algorithm now, which is sort of FIFO style because of waitqueue, but RR > > restart wakes up queue in the order of request queue. > > From sbitmap.h: > > #define SBQ_WAIT_QUEUES 8 > > What do you think is the effect of your patch if more than eight LUNs are > active and the SCSI queue is full? Jens introduces multiple wait queues for scattering atomic operation on multiple counters, in theory, the way is understood easily if you just treat it as one whole queue in concept. And about the situations you mentioned, no any special as normal cases or thousands of LUNs. Just a batch of queues are waken up from one single wait queue(sbq_wait_state), and inside each wait queue, queues are handled actually in FIFO order. Or what is your expected ideal behaviour about fairness? thanks, Ming
On Fri, 2018-08-03 at 00:58 +0800, Ming Lei wrote: > And about the situations you mentioned, no any special as normal cases > or thousands of LUNs. Just a batch of queues are waken up from one > single wait queue(sbq_wait_state), and inside each wait queue, queues > are handled actually in FIFO order. > > Or what is your expected ideal behaviour about fairness? Hello Ming, What I expect is if the number of LUNs is really large that all LUNs are treated equally. Unless someone can set up a test that demonstrates that this is still the case for commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"), I will assume that that commit breaks fairness. Bart.
On Thu, Aug 02, 2018 at 05:08:55PM +0000, Bart Van Assche wrote: > On Fri, 2018-08-03 at 00:58 +0800, Ming Lei wrote: > > And about the situations you mentioned, no any special as normal cases > > or thousands of LUNs. Just a batch of queues are waken up from one > > single wait queue(sbq_wait_state), and inside each wait queue, queues > > are handled actually in FIFO order. > > > > Or what is your expected ideal behaviour about fairness? > > Hello Ming, > > What I expect is if the number of LUNs is really large that all LUNs are treated > equally. Unless someone can set up a test that demonstrates that this is still Some of idle LUNs shouldn't be treated equally as other LUNs which need to serve. Also as I mentioned, the original RR style isn't better than the new way actually, since now we handle queues in sort of FIFO style, thanks wait queue. Not mentioning big CPU utilization is consumed unnecessarily for iterating over all queues even though there is only one active queue, is this fair from system view? Thanks, Ming
On Fri, 2018-08-03 at 01:17 +0800, Ming Lei wrote: > Not mentioning big CPU utilization is consumed unnecessarily for iterating > over all queues even though there is only one active queue, is this fair from > system view? I hope that someone will come up some day with a better solution than the list_for_each_entry_rcu_rr() loop. But as long as such a solution has not yet been found that loop should stay in order to preserve fair handling of large numbers of LUNs. Bart.
On Thu, Aug 02, 2018 at 05:24:15PM +0000, Bart Van Assche wrote: > On Fri, 2018-08-03 at 01:17 +0800, Ming Lei wrote: > > Not mentioning big CPU utilization is consumed unnecessarily for iterating > > over all queues even though there is only one active queue, is this fair from > > system view? > > I hope that someone will come up some day with a better solution than the > list_for_each_entry_rcu_rr() loop. But as long as such a solution has not yet > been found that loop should stay in order to preserve fair handling of large > numbers of LUNs. Again list_for_each_entry_rcu_rr() isn't fair for small number of active queue in this case, because this way decreases IOPS a lot and consumes CPU much just for checking all thousands of queues unnecessarily, even though 99% of them are idle. Thanks, Ming
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 56c493c..3270ad0 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -54,7 +54,7 @@ void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio) * Mark a hardware queue as needing a restart. For shared queues, maintain * a count of how many hardware queues are marked for restart. */ -static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) +void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) return; @@ -202,15 +202,13 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) * we only ever dispatch a fraction of the requests available because * of low device queue depth. Once we pull requests out of the IO * scheduler, we can no longer merge or sort them. So it's best to - * leave them there for as long as we can. Mark the hw queue as - * needing a restart in that case. + * leave them there for as long as we can. * * We want to dispatch from the scheduler if there was nothing * on the dispatch list or we were able to dispatch from the * dispatch list. */ if (!list_empty(&rq_list)) { - blk_mq_sched_mark_restart_hctx(hctx); if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { if (has_sched_dispatch) blk_mq_do_dispatch_sched(hctx); @@ -417,7 +415,6 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) */ if (!atomic_read(&queue->shared_hctx_restart)) return; - rcu_read_lock(); list_for_each_entry_rcu_rr(q, queue, &set->tag_list, tag_set_list) { diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 0cb8f93..650cbfc 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -32,6 +32,7 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx); void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx); +void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx); static inline bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) diff --git a/block/blk-mq.c b/block/blk-mq.c index 45df734..b47627a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1086,6 +1086,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool no_tag = false; int errors, queued; blk_status_t ret = BLK_STS_OK; + bool marked = false; if (list_empty(list)) return false; @@ -1096,6 +1097,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * Now process all the entries, sending them to the driver. */ errors = queued = 0; +retry: do { struct blk_mq_queue_data bd; @@ -1115,12 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, */ if (!blk_mq_mark_tag_wait(&hctx, rq)) { blk_mq_put_dispatch_budget(hctx); - /* - * For non-shared tags, the RESTART check - * will suffice. - */ - if (hctx->flags & BLK_MQ_F_TAG_SHARED) - no_tag = true; + no_tag = true; break; } } @@ -1172,42 +1169,61 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { - bool needs_restart; + bool run_queue = false; + + if (!no_tag && (ret != BLK_STS_RESOURCE) && !marked) { + blk_mq_sched_mark_restart_hctx(hctx); + marked = true; + /* + * .queue_rq will put the budget, so we need to get it again. + */ + got_budget = false; + goto retry; + } spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); /* - * If SCHED_RESTART was set by the caller of this function and - * it is no longer set that means that it was cleared by another - * thread and hence that a queue rerun is needed. + * The reason for why we reach here could be: + * - No driver tag + * For the shared-tags case, the tag wakeup hook will work + * and hctx_may_queue will avoid starvation of other ones. + * For the non-shared-tags case, hctx restart is employed. * - * If 'no_tag' is set, that means that we failed getting - * a driver tag with an I/O scheduler attached. If our dispatch - * waitqueue is no longer active, ensure that we run the queue - * AFTER adding our entries back to the list. + * Because reqs have not been queued to hctx->dispatch list + * when blk_mq_mark_tag_wait, both of two conditions will be + * triggered w/ an empty hctx->dispatch and do nothing finally. + * So recheck them here and rerun the queue if needed. * - * If no I/O scheduler has been configured it is possible that - * the hardware queue got stopped and restarted before requests - * were pushed back onto the dispatch list. Rerun the queue to - * avoid starvation. Notes: - * - blk_mq_run_hw_queue() checks whether or not a queue has - * been stopped before rerunning a queue. - * - Some but not all block drivers stop a queue before - * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq - * and dm-rq. - * - * If driver returns BLK_STS_RESOURCE and SCHED_RESTART - * bit is set, run queue after a delay to avoid IO stalls - * that could otherwise occur if the queue is idle. + * - BLK_STS_DEV_RESOURCE/no budget + * Employ the hctx restart mechanism to restart the queue. + * The LLDD resources could be released: + * - before mark SCHED_RESTART. We retry to dispatch to fix + * this case. + * - after mark SCHED_RESTART before queue requests on + * hctx->dispatch. we recheck the SCHED_RESTART, if not there, + * rerun the hctx. + * - BLK_STS_RESOURCE + * Run queue after a delay to avoid IO stalls. + * More details please refer to comment of BLK_STS_DEV_RESOURCE. */ - needs_restart = blk_mq_sched_needs_restart(hctx); - if (!needs_restart || - (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) - blk_mq_run_hw_queue(hctx, true); - else if (needs_restart && (ret == BLK_STS_RESOURCE)) - blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); + + if (no_tag) { + if (hctx->flags & BLK_MQ_F_TAG_SHARED) + run_queue = list_empty_careful(&hctx->dispatch_wait.entry); + else + run_queue = !blk_mq_sched_needs_restart(hctx); + if (run_queue) + blk_mq_run_hw_queue(hctx, true); + } else if (ret == BLK_STS_RESOURCE){ + if (blk_mq_sched_needs_restart(hctx)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); + } else { + if (!blk_mq_sched_needs_restart(hctx)) + blk_mq_run_hw_queue(hctx, true); + } return false; }
Currently, we will always set SCHED_RESTART whenever there are requests in hctx->dispatch, then when request is completed and freed the hctx queues will be restarted to avoid IO hang. This is unnecessary most of time. Especially when there are lots of LUNs attached to one host, the RR restart loop could be very expensive. Requests will be queued on hctx dispath list due to: - flush and cpu hotplug cases - no driver tag w/ io scheduler - LLDD returns BLK_STS_RESOURCE/BLK_STS_DEV_RESOURCE - no driver budget We needn't care about the 1st case, it is normal insert. Regarding the 2nd case, if tags are shared, the sbitmap_queue wakeup hook will work and hctx_may_queue will avoid starvation of other ones, otherwise, hctx restart is employed, but it is done by blk_mq_mark_tag_wait. We need hctx restart for the 3rd and 4th case to rerun the hctx queue when some LLDD limits are lifted due to in-flight requests are completed. So we remove the blk_mq_sched_mark_restart_hctx from blk_mq_sched_dispatch_requests and only use it when LLDD returns BLK_STS_DEV_RESOURCE and no driver budget. In addition, there some race conditions fixed in this patch. More details please refer to comment of blk_mq_dispatch_rq_list. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> --- block/blk-mq-sched.c | 7 ++--- block/blk-mq-sched.h | 1 + block/blk-mq.c | 82 +++++++++++++++++++++++++++++++--------------------- 3 files changed, 52 insertions(+), 38 deletions(-)