[V5,8/8] blk-mq: improve bio merge from blk-mq sw queue

Message ID	20170930112655.31451-9-ming.lei@redhat.com (mailing list archive)
State	Superseded, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com CB3E913A88 From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@fb.com>, linux-block@vger.kernel.org, Christoph Hellwig <hch@infradead.org>, Mike Snitzer <snitzer@redhat.com>, dm-devel@redhat.com Date: Sat, 30 Sep 2017 19:26:55 +0800 Message-Id: <20170930112655.31451-9-ming.lei@redhat.com> In-Reply-To: <20170930112655.31451-1-ming.lei@redhat.com> References: <20170930112655.31451-1-ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com>, Paolo Valente <paolo.valente@linaro.org>, Laurence Oberman <loberman@redhat.com>, Oleksandr Natalenko <oleksandr@natalenko.name>, linux-kernel@vger.kernel.org, Bart Van Assche <bart.vanassche@sandisk.com>, Ming Lei <ming.lei@redhat.com>, Tom Nguyen <tom81094@gmail.com> Subject: [dm-devel] [PATCH V5 8/8] blk-mq: improve bio merge from blk-mq sw queue Precedence: junk MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Message ID

20170930112655.31451-9-ming.lei@redhat.com (mailing list archive)

State

Superseded, archived

Delegated to:

Mike Snitzer

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com CB3E913A88
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@fb.com>, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Mike Snitzer <snitzer@redhat.com>, dm-devel@redhat.com
Date: Sat, 30 Sep 2017 19:26:55 +0800
Message-Id: <20170930112655.31451-9-ming.lei@redhat.com>
In-Reply-To: <20170930112655.31451-1-ming.lei@redhat.com>
References: <20170930112655.31451-1-ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>, Paolo Valente <paolo.valente@linaro.org>,
	Laurence Oberman <loberman@redhat.com>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	linux-kernel@vger.kernel.org,
	Bart Van Assche <bart.vanassche@sandisk.com>, 
	Ming Lei <ming.lei@redhat.com>, Tom Nguyen <tom81094@gmail.com>
Subject: [dm-devel] [PATCH V5 8/8] blk-mq: improve bio merge from blk-mq sw
	queue
Precedence: junk
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com

Commit Message

Ming Lei Sept. 30, 2017, 11:26 a.m. UTC

This patch uses hash table to do bio merge from sw queue,
then we can align to blk-mq scheduler/block legacy's way
for bio merge.

Turns out bio merge via hash table is more efficient than
simple merge on the last 8 requests in sw queue. On SCSI SRP,
it is observed ~10% IOPS is increased in sequential IO test
with this patch.

It is also one step forward to real 'none' scheduler, in which
way the blk-mq scheduler framework can be more clean.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Tom Nguyen <tom81094@gmail.com>
Tested-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 49 ++++++++++++-------------------------------------
 block/blk-mq.c       | 28 +++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 40 deletions(-)

Comments

Christoph Hellwig Oct. 3, 2017, 9:21 a.m. UTC | #1

This looks generally good to me, but I really worry about the impact
on very high iops devices.  Did you try this e.g. for random reads
from unallocated blocks on an enterprise NVMe SSD?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Ming Lei Oct. 9, 2017, 4:28 a.m. UTC | #2

On Tue, Oct 03, 2017 at 02:21:43AM -0700, Christoph Hellwig wrote:
> This looks generally good to me, but I really worry about the impact
> on very high iops devices.  Did you try this e.g. for random reads
> from unallocated blocks on an enterprise NVMe SSD?

Looks no such impact, please see the following data
in the fio test(libaio, direct, bs=4k, 64jobs, randread, none scheduler)

[root@storageqe-62 results]# ../parse_fio 4.14.0-rc2.no_blk_mq_perf+-nvme-64jobs-mq-none.log 4.14.0-rc2.BLK_MQ_PERF_V5+-nvme-64jobs-mq-none.log
---------------------------------------
 IOPS(K)  |    NONE     |    NONE
---------------------------------------
randread  |      650.98 |      653.15
---------------------------------------

OR:

If you worry about this impact, can we simply disable merge on NVMe
for none scheduler? It is basically impossible to merge NVMe's
request/bio when none is used, but it can be doable in case of kyber
scheduler.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a58f4746317c..8262ae71e0cd 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -260,50 +260,25 @@  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
-/*
- * Reverse check our software queue for entries that we could potentially
- * merge with. Currently includes a hand-wavy stop count of 8, to not spend
- * too much time checking for merges.
- */
-static bool blk_mq_attempt_merge(struct request_queue *q,
+static bool blk_mq_ctx_try_merge(struct request_queue *q,
 				 struct blk_mq_ctx *ctx, struct bio *bio)
 {
-	struct request *rq;
-	int checked = 8;
+	struct request *rq, *free = NULL;
+	enum elv_merge type;
+	bool merged;
 
 	lockdep_assert_held(&ctx->lock);
 
-	list_for_each_entry_reverse(rq, &ctx->rq_list, queuelist) {
-		bool merged = false;
-
-		if (!checked--)
-			break;
-
-		if (!blk_rq_merge_ok(rq, bio))
-			continue;
+	type = elv_merge_ctx(q, &rq, bio, ctx);
+	merged = __blk_mq_try_merge(q, bio, &free, rq, type);
 
-		switch (blk_try_merge(rq, bio)) {
-		case ELEVATOR_BACK_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_back_merge(q, rq, bio);
-			break;
-		case ELEVATOR_FRONT_MERGE:
-			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_front_merge(q, rq, bio);
-			break;
-		case ELEVATOR_DISCARD_MERGE:
-			merged = bio_attempt_discard_merge(q, rq, bio);
-			break;
-		default:
-			continue;
-		}
+	if (free)
+		blk_mq_free_request(free);
 
-		if (merged)
-			ctx->rq_merged++;
-		return merged;
-	}
+	if (merged)
+		ctx->rq_merged++;
 
-	return false;
+	return merged;
 }
 
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
@@ -321,7 +296,7 @@  bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 	if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
 		/* default per sw-queue merge */
 		spin_lock(&ctx->lock);
-		ret = blk_mq_attempt_merge(q, ctx, bio);
+		ret = blk_mq_ctx_try_merge(q, ctx, bio);
 		spin_unlock(&ctx->lock);
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 561a663cdd0e..9a3a561a63b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -849,6 +849,18 @@  static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static void blk_mq_ctx_remove_rq_list(struct blk_mq_ctx *ctx,
+		struct list_head *head)
+{
+	struct request *rq;
+
+	lockdep_assert_held(&ctx->lock);
+
+	list_for_each_entry(rq, head, queuelist)
+		rqhash_del(rq);
+	ctx->last_merge = NULL;
+}
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
@@ -863,6 +875,7 @@  static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
 	sbitmap_clear_bit(sb, bitnr);
 	spin_lock(&ctx->lock);
 	list_splice_tail_init(&ctx->rq_list, flush_data->list);
+	blk_mq_ctx_remove_rq_list(ctx, flush_data->list);
 	spin_unlock(&ctx->lock);
 	return true;
 }
@@ -892,17 +905,23 @@  static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void *d
 	struct dispatch_rq_data *dispatch_data = data;
 	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
 	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+	struct request *rq = NULL;
 
 	spin_lock(&ctx->lock);
 	if (unlikely(!list_empty(&ctx->rq_list))) {
-		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
-		list_del_init(&dispatch_data->rq->queuelist);
+		rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&rq->queuelist);
+		rqhash_del(rq);
 		if (list_empty(&ctx->rq_list))
 			sbitmap_clear_bit(sb, bitnr);
 	}
+	if (ctx->last_merge == rq)
+		ctx->last_merge = NULL;
 	spin_unlock(&ctx->lock);
 
-	return !dispatch_data->rq;
+	dispatch_data->rq = rq;
+
+	return !rq;
 }
 
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
@@ -1433,6 +1452,8 @@  static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 		list_add(&rq->queuelist, &ctx->rq_list);
 	else
 		list_add_tail(&rq->queuelist, &ctx->rq_list);
+
+	rqhash_add(ctx->hash, rq);
 }
 
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
@@ -1968,6 +1989,7 @@  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		blk_mq_ctx_remove_rq_list(ctx, &tmp);
 		blk_mq_hctx_clear_pending(hctx, ctx);
 	}
 	spin_unlock(&ctx->lock);

[V5,8/8] blk-mq: improve bio merge from blk-mq sw queue

Commit Message

Comments

Patch