From patchwork Fri Oct 13 16:28:00 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10005263
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	0EEFA602B3 for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 13 Oct 2017 16:28:07 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7FD9290E8
	for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 13 Oct 2017 16:28:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id DCE26290ED; Fri, 13 Oct 2017 16:28:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, RCVD_IN_DNSWL_HI,
	RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CB8C7290E8
	for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 13 Oct 2017 16:28:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751455AbdJMQ2E (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Fri, 13 Oct 2017 12:28:04 -0400
Received: from mail-it0-f53.google.com ([209.85.214.53]:52971 "EHLO
	mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751057AbdJMQ2D (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Fri, 13 Oct 2017 12:28:03 -0400
Received: by mail-it0-f53.google.com with SMTP id j140so11222625itj.1
	for <linux-block@vger.kernel.org>;
	Fri, 13 Oct 2017 09:28:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=kernel-dk.20150623.gappssmtp.com; s=20150623;
	h=subject:to:cc:references:from:message-id:date:user-agent
	:mime-version:in-reply-to:content-language:content-transfer-encoding;
	bh=vyr47Lo0ErT+Ww1LEbGJ8p3raCCihS+b07iFJVbbmdM=;
	b=VJmYD0JVnqfKMmf3bWYlCl/+xOVrC6xEHzjLVc+SGCPiF5x3m7DMplyncZZZB5eXRH
	Yt6qwC2Z48WtxI5bZ9V6gMM7FLpFgrnBtpGDfSLUJ1P8rB3DTne0n1hk5qxN1aZQAKr8
	txjZOEsGKbu4+tXYquk7QPrK8Shv920HfshqS9Hg7GASWyZ8pU55zBX8I3VcH8b4xTPM
	7MSmeVniuXycgE3r7HUFKdpEVIF6Q57fbzThLCDnU7aCy7pVhLWFt0GEwbkEPuJdZhSz
	aU1cMRsBtOQP+RHsvto+fy/KRIs8121qQ5qUx3G2xuuRA+WY7xeIUhJ6ACwgQ+JpYPoC
	eJAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:subject:to:cc:references:from:message-id:date
	:user-agent:mime-version:in-reply-to:content-language
	:content-transfer-encoding;
	bh=vyr47Lo0ErT+Ww1LEbGJ8p3raCCihS+b07iFJVbbmdM=;
	b=aWx/rMPNBxAKTjF+4KpQM32Ibyk++BqRWF8hjHztt9aZ813FvfLx4rv9rhVujMoIvO
	Vo0KzMmXx70M/VBbxSOiBvzzsMKnJpjf/ONH9mYtssZ0ZoGXF3jVOQPLZ/KyS0iDbaN2
	xvPoGa8G+7ktuqXEkkTGLhtOAAETjN57BcSZ6RtDbZ8w/vr/yKnSYLpydLDX/lKzfRMX
	j4RThKq7FLS4hVPpAh0LZSnCwQhXu42xrjIKYbVKg8gtGqNqlt8DZuraN44J3zSZqo1f
	hqAeZ2OG4sJgR36E4Y0VoFx1xu30f4IKkV1/AmZGlSOICjqGeZerK75GdCFfGFyI/SxR
	aqyQ==
X-Gm-Message-State: AMCzsaW8I1fOJnwmY9uFLvNOsPWMJRn1GrqZXY+tQT9DFLO/vqKC+NOw
	GRZNLk670ECZl0kcRTcVwWWvow==
X-Google-Smtp-Source: 
 AOwi7QDslDpS+uVCzM/6rXl5Uo7IsmiS0tAIxzEUlaSfUaEL8PcavllN+dSbq/usdKocB2/sSgpByA==
X-Received: by 10.36.69.91 with SMTP id y88mr2738632ita.99.1507912082545;
	Fri, 13 Oct 2017 09:28:02 -0700 (PDT)
Received: from [192.168.1.154] ([216.160.245.98])
	by smtp.gmail.com with ESMTPSA id
	r124sm875452ita.13.2017.10.13.09.28.01
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 13 Oct 2017 09:28:01 -0700 (PDT)
Subject: Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in
	blk_mq_ops
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org, Christoph Hellwig <hch@infradead.org>,
	Bart Van Assche <bart.vanassche@sandisk.com>,
	Laurence Oberman <loberman@redhat.com>,
	Paolo Valente <paolo.valente@linaro.org>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	Tom Nguyen <tom81094@gmail.com>, linux-kernel@vger.kernel.org,
	linux-scsi@vger.kernel.org, Omar Sandoval <osandov@fb.com>,
	John Garry <john.garry@huawei.com>
References: <20171012183704.22326-1-ming.lei@redhat.com>
	<20171012183704.22326-5-ming.lei@redhat.com>
	<9a741c03-90a3-e583-ddde-0ed71c8570a2@kernel.dk>
	<20171013001919.GA24715@ming.t460p>
	<6efdb459-8746-562d-06dc-5b3e172076e1@kernel.dk>
	<20171013160731.GA30899@ming.t460p>
	<845bd050-8566-8749-d73f-9a3731c7736f@kernel.dk>
	<20171013162111.GC30899@ming.t460p>
From: Jens Axboe <axboe@kernel.dk>
Message-ID: <18389d59-4228-e4f5-c6e2-ebeabe17fc37@kernel.dk>
Date: Fri, 13 Oct 2017 10:28:00 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <20171013162111.GC30899@ming.t460p>
Content-Language: en-US
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 10/13/2017 10:21 AM, Ming Lei wrote:
> On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote:
>> On 10/13/2017 10:07 AM, Ming Lei wrote:
>>> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:
>>>> On 10/12/2017 06:19 PM, Ming Lei wrote:
>>>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
>>>>>> On 10/12/2017 12:37 PM, Ming Lei wrote:
>>>>>>> For SCSI devices, there is often per-request-queue depth, which need
>>>>>>> to be respected before queuing one request.
>>>>>>>
>>>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq()
>>>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O
>>>>>>> merge may not be good, because when the per-request-queue depth can't be
>>>>>>> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
>>>>>>> has to staty in hctx->dispatch list, and never got chance to participate
>>>>>>> into I/O merge.
>>>>>>>
>>>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
>>>>>>> then we can try to get reserved budget first before dequeuing request.
>>>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request
>>>>>>> at all, then I/O merge can get improved a lot.
>>>>>>
>>>>>> I can't help but think that it would be cleaner to just be able to
>>>>>> reinsert the request into the scheduler properly, if we fail to
>>>>>> dispatch it. Bart hinted at that earlier as well.
>>>>>
>>>>> Actually when I start to investigate the issue, the 1st thing I tried
>>>>> is to reinsert, but that way is even worse on qla2xxx.
>>>>>
>>>>> Once request is dequeued, the IO merge chance is decreased a lot.
>>>>> With none scheduler, it becomes not possible to merge because
>>>>> we only try to merge over the last 8 requests. With mq-deadline,
>>>>> when one request is reinserted, another request may be dequeued
>>>>> at the same time.
>>>>
>>>> I don't care too much about 'none'. If perfect merging is crucial for
>>>> getting to the performance level you want on the hardware you are using,
>>>> you should not be using 'none'. 'none' will work perfectly fine for NVMe
>>>> etc style devices, where we are not dependent on merging to the same
>>>> extent that we are on other devices.
>>>>
>>>> mq-deadline reinsertion will be expensive, that's in the nature of that
>>>> beast. It's basically the same as a normal request inserition.  So for
>>>> that, we'd have to be a bit careful not to run into this too much. Even
>>>> with a dumb approach, it should only happen 1 out of N times, where N is
>>>> the typical point at which the device will return STS_RESOURCE. The
>>>> reinsertion vs dequeue should be serialized with your patch to do that,
>>>> at least for the single queue mq-deadline setup. In fact, I think your
>>>> approach suffers from that same basic race, in that the budget isn't a
>>>> hard allocation, it's just a hint. It can change from the time you check
>>>> it, and when you go and dispatch the IO, if you don't serialize that
>>>> part. So really should be no different in that regard.
>>>
>>> In case of SCSI, the .get_buget is done as atomic counting,
>>> and it is completely effective to avoid unnecessary dequeue, please take
>>> a look at patch 6.
>>
>> Looks like you are right, I had initially misread that as just checking
>> the busy count. But you are actually getting the count at that point,
>> so it should be solid.
>>
>>>>> Not mention the cost of acquiring/releasing lock, that work
>>>>> is just doing useless work and wasting CPU.
>>>>
>>>> Sure, my point is that if it doesn't happen too often, it doesn't really
>>>> matter. It's not THAT expensive.
>>>
>>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
>>> it is quite easy to trigger STS_RESOURCE.
>>
>> Ugh, that is low.
>>
>> OK, I think we should just roll with this and see how far we can go. I'll
>> apply it for 4.15.
> 
> OK, I have some update, will post a new version soon.

Fold something like this into it then:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ccbbc7e108ea..b7bf84b5ddf2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -102,13 +102,12 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 				!e->type->ops.mq.has_work(hctx))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = e->type->ops.mq.dispatch_request(hctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);
@@ -140,13 +139,12 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 255a705f8672..008c975b6f4b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1087,14 +1087,6 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (q->mq_ops->put_budget && got_budget)
-		q->mq_ops->put_budget(hctx);
-}
-
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		bool got_budget)
 {
@@ -1125,7 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * rerun the hardware queue when a tag is freed.
 			 */
 			if (!blk_mq_dispatch_wait_add(hctx)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}
 
@@ -1135,16 +1127,13 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * hardware queue to the wait queue.
 			 */
 			if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}
 		}
 
-		if (!got_budget) {
-			if (q->mq_ops->get_budget &&
-					!q->mq_ops->get_budget(hctx))
-				break;
-		}
+		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
+			break;
 
 		list_del_init(&rq->queuelist);
 
@@ -1642,7 +1631,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
-	if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) {
+	if (!blk_mq_get_dispatch_budget(hctx)) {
 		blk_mq_put_driver_tag(rq);
 		goto insert;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4d12ef08b0a9..9a1426e8b6e5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -139,4 +139,23 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 			unsigned int inflight[2]);
 
+static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (!q->mq_ops->get_budget)
+		return true;
+
+	return q->mq_ops->get_budget(hctx);
+}
+
+static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx,
+					      bool got_budget)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (got_budget && q->mq_ops->put_budget)
+		q->mq_ops->put_budget(hctx);
+}
+
 #endif