From patchwork Mon Oct  9 11:24:24 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ming Lei <ming.lei@redhat.com>
X-Patchwork-Id: 9992651
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	731FE60244 for <patchwork-linux-block@patchwork.kernel.org>;
	Mon,  9 Oct 2017 11:25:18 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 65EBA28797
	for <patchwork-linux-block@patchwork.kernel.org>;
	Mon,  9 Oct 2017 11:25:18 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5A857287A3; Mon,  9 Oct 2017 11:25:18 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F21D028797
	for <patchwork-linux-block@patchwork.kernel.org>;
	Mon,  9 Oct 2017 11:25:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754176AbdJILZD (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Mon, 9 Oct 2017 07:25:03 -0400
Received: from mx1.redhat.com ([209.132.183.28]:40576 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754107AbdJILZB (ORCPT <rfc822;linux-block@vger.kernel.org>);
	Mon, 9 Oct 2017 07:25:01 -0400
Received: from smtp.corp.redhat.com
	(int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id A5BA7C04AC52;
	Mon,  9 Oct 2017 11:25:01 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A5BA7C04AC52
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com;
	dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com;
	spf=fail smtp.mailfrom=ming.lei@redhat.com
Received: from localhost (ovpn-12-144.pek2.redhat.com [10.72.12.144])
	by smtp.corp.redhat.com (Postfix) with ESMTP id DF2236060A;
	Mon,  9 Oct 2017 11:25:00 +0000 (UTC)
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@fb.com>, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Mike Snitzer <snitzer@redhat.com>, dm-devel@redhat.com
Cc: Bart Van Assche <bart.vanassche@sandisk.com>,
	Laurence Oberman <loberman@redhat.com>,
	Paolo Valente <paolo.valente@linaro.org>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	Tom Nguyen <tom81094@gmail.com>, linux-kernel@vger.kernel.org,
	linux-scsi@vger.kernel.org, Omar Sandoval <osandov@fb.com>,
	Ming Lei <ming.lei@redhat.com>
Subject: [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in
	->dispatch are flushed
Date: Mon,  9 Oct 2017 19:24:24 +0800
Message-Id: <20171009112424.30524-6-ming.lei@redhat.com>
In-Reply-To: <20171009112424.30524-1-ming.lei@redhat.com>
References: <20171009112424.30524-1-ming.lei@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.5.110.31]);
	Mon, 09 Oct 2017 11:25:01 +0000 (UTC)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 38 ++++++++++++++++++++++++++++++++------
 block/blk-mq.c         |  5 +++++
 include/linux/blk-mq.h |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e5dccc9f6f1d..6c15487bc3ff 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -181,6 +181,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(SCHED_RESTART),
 	HCTX_STATE_NAME(TAG_WAITING),
 	HCTX_STATE_NAME(START_ON_RUN),
+	HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 14b354f617e5..9f549711da84 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -95,6 +95,18 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
 
+	/*
+	 * If DISPATCH_BUSY is set, that means hw queue is busy
+	 * and requests in the list of hctx->dispatch need to
+	 * be flushed first, so return early.
+	 *
+	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+	 * will be run to try to make progress, so it is always
+	 * safe to check the state here.
+	 */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
 	do {
 		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
 
@@ -121,6 +133,10 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
 
+	/* See same comment in blk_mq_do_dispatch_sched() */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
 	do {
 		struct request *rq;
 
@@ -176,12 +192,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list)) {
-			if (has_sched_dispatch)
-				blk_mq_do_dispatch_sched(hctx);
-			else
-				blk_mq_do_dispatch_ctx(hctx);
-		}
+		blk_mq_dispatch_rq_list(q, &rq_list);
+
+		/*
+		 * We may clear DISPATCH_BUSY just after it is set from
+		 * another context, the only cost is that one request is
+		 * dequeued a bit early, we can survive that. Given the
+		 * window is small enough, no need to worry about performance
+		 * effect.
+		 */
+		if (list_empty_careful(&hctx->dispatch))
+			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+
+		if (has_sched_dispatch)
+			blk_mq_do_dispatch_sched(hctx);
+		else
+			blk_mq_do_dispatch_ctx(hctx);
 	} else if (has_sched_dispatch) {
 		blk_mq_do_dispatch_sched(hctx);
 	} else if (q->queue_depth) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 394cb75d66fa..06dda6182b7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1172,6 +1172,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		/*
+		 * DISPATCH_BUSY won't be cleared until all requests
+		 * in hctx->dispatch are dispatched successfully
+		 */
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 		spin_unlock(&hctx->lock);
 
 		/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b7a366a97f3..13f6c25fa461 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
 	BLK_MQ_S_SCHED_RESTART	= 2,
 	BLK_MQ_S_TAG_WAITING	= 3,
 	BLK_MQ_S_START_ON_RUN	= 4,
+	BLK_MQ_S_DISPATCH_BUSY	= 5,
 
 	BLK_MQ_MAX_DEPTH	= 10240,