From patchwork Thu Oct 12 18:37:03 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 10002515 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1199360216 for ; Thu, 12 Oct 2017 18:39:23 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 06ED520602 for ; Thu, 12 Oct 2017 18:39:23 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EFE24283D1; Thu, 12 Oct 2017 18:39:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 79C02267EC for ; Thu, 12 Oct 2017 18:39:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755002AbdJLSjH (ORCPT ); Thu, 12 Oct 2017 14:39:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43496 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753023AbdJLSjG (ORCPT ); Thu, 12 Oct 2017 14:39:06 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F1C9CC0467DC; Thu, 12 Oct 2017 18:39:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com F1C9CC0467DC Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=ming.lei@redhat.com Received: from localhost (ovpn-12-77.pek2.redhat.com [10.72.12.77]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5C8FE7838B; Thu, 12 Oct 2017 18:38:51 +0000 (UTC) From: Ming Lei To: Jens Axboe , linux-block@vger.kernel.org, Christoph Hellwig Cc: Bart Van Assche , Laurence Oberman , Paolo Valente , Oleksandr Natalenko , Tom Nguyen , linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, Omar Sandoval , John Garry , Ming Lei Subject: [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue Date: Fri, 13 Oct 2017 02:37:03 +0800 Message-Id: <20171012183704.22326-6-ming.lei@redhat.com> In-Reply-To: <20171012183704.22326-1-ming.lei@redhat.com> References: <20171012183704.22326-1-ming.lei@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Thu, 12 Oct 2017 18:39:06 +0000 (UTC) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. Meantime there is also queue depth for each lun( .cmd_per_lun), which is often small, for example, on both lpfc and qla2xxx, .cmd_per_lun is just 3. So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver, unfortunately it is easy to cause queue busy because of the small .cmd_per_lun. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can participate into these requests, and sequential IO performance is hurt a lot. This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from sw queue so that we can dispatch them in scheduler's way, then we can avoid to dequeue too many requests from sw queue when ->dispatch isn't flushed completely. This patch improves dispatching from sw queue by using the callback of .get_budget and .put_budget Reviewed-by: Omar Sandoval Reviewed-by: Bart Van Assche Signed-off-by: Ming Lei --- Jens/Omar, once .get_budget is introduced, it can be unusual to return busy from .queue_rq, so we can't use the approach discussed to decide to take one by one or flush all. Then this patch simply checks if .get_budget is implemented, looks it is reasonable, and in future it is even possible to take more requests by getting enough budget first. block/blk-mq-sched.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++--- block/blk-mq.c | 39 +++++++++++++++++++++++++++++++ block/blk-mq.h | 2 ++ include/linux/blk-mq.h | 2 ++ 4 files changed, 103 insertions(+), 3 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index bcf4773ee1ca..ccbbc7e108ea 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -117,6 +117,50 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) return false; } +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *ctx) +{ + unsigned idx = ctx->index_hw; + + if (++idx == hctx->nr_ctx) + idx = 0; + + return hctx->ctxs[idx]; +} + +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + LIST_HEAD(rq_list); + struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from); + + do { + struct request *rq; + + if (!sbitmap_any_bit_set(&hctx->ctx_map)) + break; + + if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) + return true; + + rq = blk_mq_dequeue_from_ctx(hctx, ctx); + if (!rq) { + if (q->mq_ops->put_budget) + q->mq_ops->put_budget(hctx); + break; + } + list_add(&rq->queuelist, &rq_list); + + /* round robin for fair dispatch */ + ctx = blk_mq_next_ctx(hctx, rq->mq_ctx); + + } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); + + WRITE_ONCE(hctx->dispatch_from, ctx); + + return false; +} + void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; @@ -157,11 +201,24 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) */ if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); - if (blk_mq_dispatch_rq_list(q, &rq_list, false) && - has_sched_dispatch) - run_queue = blk_mq_do_dispatch_sched(hctx); + if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { + if (has_sched_dispatch) + run_queue = blk_mq_do_dispatch_sched(hctx); + else + run_queue = blk_mq_do_dispatch_ctx(hctx); + } } else if (has_sched_dispatch) { run_queue = blk_mq_do_dispatch_sched(hctx); + } else if (q->mq_ops->get_budget) { + /* + * If we need to get budget before queuing request, we + * dequeue request one by one from sw queue for avoiding + * to mess up I/O merge when dispatch runs out of resource. + * + * TODO: get more budgets, and dequeue more requests in + * one time. + */ + run_queue = blk_mq_do_dispatch_ctx(hctx); } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list, false); diff --git a/block/blk-mq.c b/block/blk-mq.c index e6e50a91f170..9e2fc26a9018 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list) } EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs); +struct dispatch_rq_data { + struct blk_mq_hw_ctx *hctx; + struct request *rq; +}; + +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, + void *data) +{ + struct dispatch_rq_data *dispatch_data = data; + struct blk_mq_hw_ctx *hctx = dispatch_data->hctx; + struct blk_mq_ctx *ctx = hctx->ctxs[bitnr]; + + spin_lock(&ctx->lock); + if (unlikely(!list_empty(&ctx->rq_list))) { + dispatch_data->rq = list_entry_rq(ctx->rq_list.next); + list_del_init(&dispatch_data->rq->queuelist); + if (list_empty(&ctx->rq_list)) + sbitmap_clear_bit(sb, bitnr); + } + spin_unlock(&ctx->lock); + + return !dispatch_data->rq; +} + +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *start) +{ + unsigned off = start ? start->index_hw : 0; + struct dispatch_rq_data data = { + .hctx = hctx, + .rq = NULL, + }; + + __sbitmap_for_each_set(&hctx->ctx_map, off, + dispatch_rq_from_ctx, &data); + + return data.rq; +} + static inline unsigned int queued_to_index(unsigned int queued) { if (!queued) diff --git a/block/blk-mq.h b/block/blk-mq.h index d1bfce3e69ad..4d12ef08b0a9 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list); bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx); bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, bool wait); +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *start); /* * Internal helpers for allocating/freeing the request map diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index f8c4ba2682ec..cc757dcc49e1 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx { struct sbitmap ctx_map; + struct blk_mq_ctx *dispatch_from; + struct blk_mq_ctx **ctxs; unsigned int nr_ctx;