From patchwork Fri Oct 13 16:28:00 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10005263 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0EEFA602B3 for ; Fri, 13 Oct 2017 16:28:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7FD9290E8 for ; Fri, 13 Oct 2017 16:28:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DCE26290ED; Fri, 13 Oct 2017 16:28:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CB8C7290E8 for ; Fri, 13 Oct 2017 16:28:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751455AbdJMQ2E (ORCPT ); Fri, 13 Oct 2017 12:28:04 -0400 Received: from mail-it0-f53.google.com ([209.85.214.53]:52971 "EHLO mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751057AbdJMQ2D (ORCPT ); Fri, 13 Oct 2017 12:28:03 -0400 Received: by mail-it0-f53.google.com with SMTP id j140so11222625itj.1 for ; Fri, 13 Oct 2017 09:28:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=vyr47Lo0ErT+Ww1LEbGJ8p3raCCihS+b07iFJVbbmdM=; b=VJmYD0JVnqfKMmf3bWYlCl/+xOVrC6xEHzjLVc+SGCPiF5x3m7DMplyncZZZB5eXRH Yt6qwC2Z48WtxI5bZ9V6gMM7FLpFgrnBtpGDfSLUJ1P8rB3DTne0n1hk5qxN1aZQAKr8 txjZOEsGKbu4+tXYquk7QPrK8Shv920HfshqS9Hg7GASWyZ8pU55zBX8I3VcH8b4xTPM 7MSmeVniuXycgE3r7HUFKdpEVIF6Q57fbzThLCDnU7aCy7pVhLWFt0GEwbkEPuJdZhSz aU1cMRsBtOQP+RHsvto+fy/KRIs8121qQ5qUx3G2xuuRA+WY7xeIUhJ6ACwgQ+JpYPoC eJAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=vyr47Lo0ErT+Ww1LEbGJ8p3raCCihS+b07iFJVbbmdM=; b=aWx/rMPNBxAKTjF+4KpQM32Ibyk++BqRWF8hjHztt9aZ813FvfLx4rv9rhVujMoIvO Vo0KzMmXx70M/VBbxSOiBvzzsMKnJpjf/ONH9mYtssZ0ZoGXF3jVOQPLZ/KyS0iDbaN2 xvPoGa8G+7ktuqXEkkTGLhtOAAETjN57BcSZ6RtDbZ8w/vr/yKnSYLpydLDX/lKzfRMX j4RThKq7FLS4hVPpAh0LZSnCwQhXu42xrjIKYbVKg8gtGqNqlt8DZuraN44J3zSZqo1f hqAeZ2OG4sJgR36E4Y0VoFx1xu30f4IKkV1/AmZGlSOICjqGeZerK75GdCFfGFyI/SxR aqyQ== X-Gm-Message-State: AMCzsaW8I1fOJnwmY9uFLvNOsPWMJRn1GrqZXY+tQT9DFLO/vqKC+NOw GRZNLk670ECZl0kcRTcVwWWvow== X-Google-Smtp-Source: AOwi7QDslDpS+uVCzM/6rXl5Uo7IsmiS0tAIxzEUlaSfUaEL8PcavllN+dSbq/usdKocB2/sSgpByA== X-Received: by 10.36.69.91 with SMTP id y88mr2738632ita.99.1507912082545; Fri, 13 Oct 2017 09:28:02 -0700 (PDT) Received: from [192.168.1.154] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id r124sm875452ita.13.2017.10.13.09.28.01 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 13 Oct 2017 09:28:01 -0700 (PDT) Subject: Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops To: Ming Lei Cc: linux-block@vger.kernel.org, Christoph Hellwig , Bart Van Assche , Laurence Oberman , Paolo Valente , Oleksandr Natalenko , Tom Nguyen , linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, Omar Sandoval , John Garry References: <20171012183704.22326-1-ming.lei@redhat.com> <20171012183704.22326-5-ming.lei@redhat.com> <9a741c03-90a3-e583-ddde-0ed71c8570a2@kernel.dk> <20171013001919.GA24715@ming.t460p> <6efdb459-8746-562d-06dc-5b3e172076e1@kernel.dk> <20171013160731.GA30899@ming.t460p> <845bd050-8566-8749-d73f-9a3731c7736f@kernel.dk> <20171013162111.GC30899@ming.t460p> From: Jens Axboe Message-ID: <18389d59-4228-e4f5-c6e2-ebeabe17fc37@kernel.dk> Date: Fri, 13 Oct 2017 10:28:00 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171013162111.GC30899@ming.t460p> Content-Language: en-US Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 10/13/2017 10:21 AM, Ming Lei wrote: > On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote: >> On 10/13/2017 10:07 AM, Ming Lei wrote: >>> On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote: >>>> On 10/12/2017 06:19 PM, Ming Lei wrote: >>>>> On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote: >>>>>> On 10/12/2017 12:37 PM, Ming Lei wrote: >>>>>>> For SCSI devices, there is often per-request-queue depth, which need >>>>>>> to be respected before queuing one request. >>>>>>> >>>>>>> The current blk-mq always dequeues one request first, then calls .queue_rq() >>>>>>> to dispatch the request to lld. One obvious issue of this way is that I/O >>>>>>> merge may not be good, because when the per-request-queue depth can't be >>>>>>> respected, .queue_rq() has to return BLK_STS_RESOURCE, then this request >>>>>>> has to staty in hctx->dispatch list, and never got chance to participate >>>>>>> into I/O merge. >>>>>>> >>>>>>> This patch introduces .get_budget and .put_budget callback in blk_mq_ops, >>>>>>> then we can try to get reserved budget first before dequeuing request. >>>>>>> Once we can't get budget for queueing I/O, we don't need to dequeue request >>>>>>> at all, then I/O merge can get improved a lot. >>>>>> >>>>>> I can't help but think that it would be cleaner to just be able to >>>>>> reinsert the request into the scheduler properly, if we fail to >>>>>> dispatch it. Bart hinted at that earlier as well. >>>>> >>>>> Actually when I start to investigate the issue, the 1st thing I tried >>>>> is to reinsert, but that way is even worse on qla2xxx. >>>>> >>>>> Once request is dequeued, the IO merge chance is decreased a lot. >>>>> With none scheduler, it becomes not possible to merge because >>>>> we only try to merge over the last 8 requests. With mq-deadline, >>>>> when one request is reinserted, another request may be dequeued >>>>> at the same time. >>>> >>>> I don't care too much about 'none'. If perfect merging is crucial for >>>> getting to the performance level you want on the hardware you are using, >>>> you should not be using 'none'. 'none' will work perfectly fine for NVMe >>>> etc style devices, where we are not dependent on merging to the same >>>> extent that we are on other devices. >>>> >>>> mq-deadline reinsertion will be expensive, that's in the nature of that >>>> beast. It's basically the same as a normal request inserition. So for >>>> that, we'd have to be a bit careful not to run into this too much. Even >>>> with a dumb approach, it should only happen 1 out of N times, where N is >>>> the typical point at which the device will return STS_RESOURCE. The >>>> reinsertion vs dequeue should be serialized with your patch to do that, >>>> at least for the single queue mq-deadline setup. In fact, I think your >>>> approach suffers from that same basic race, in that the budget isn't a >>>> hard allocation, it's just a hint. It can change from the time you check >>>> it, and when you go and dispatch the IO, if you don't serialize that >>>> part. So really should be no different in that regard. >>> >>> In case of SCSI, the .get_buget is done as atomic counting, >>> and it is completely effective to avoid unnecessary dequeue, please take >>> a look at patch 6. >> >> Looks like you are right, I had initially misread that as just checking >> the busy count. But you are actually getting the count at that point, >> so it should be solid. >> >>>>> Not mention the cost of acquiring/releasing lock, that work >>>>> is just doing useless work and wasting CPU. >>>> >>>> Sure, my point is that if it doesn't happen too often, it doesn't really >>>> matter. It's not THAT expensive. >>> >>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3, >>> it is quite easy to trigger STS_RESOURCE. >> >> Ugh, that is low. >> >> OK, I think we should just roll with this and see how far we can go. I'll >> apply it for 4.15. > > OK, I have some update, will post a new version soon. Fold something like this into it then: diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index ccbbc7e108ea..b7bf84b5ddf2 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -102,13 +102,12 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) !e->type->ops.mq.has_work(hctx)) break; - if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) + if (!blk_mq_get_dispatch_budget(hctx)) return true; rq = e->type->ops.mq.dispatch_request(hctx); if (!rq) { - if (q->mq_ops->put_budget) - q->mq_ops->put_budget(hctx); + blk_mq_put_dispatch_budget(hctx, true); break; } list_add(&rq->queuelist, &rq_list); @@ -140,13 +139,12 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) if (!sbitmap_any_bit_set(&hctx->ctx_map)) break; - if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) + if (!blk_mq_get_dispatch_budget(hctx)) return true; rq = blk_mq_dequeue_from_ctx(hctx, ctx); if (!rq) { - if (q->mq_ops->put_budget) - q->mq_ops->put_budget(hctx); + blk_mq_put_dispatch_budget(hctx, true); break; } list_add(&rq->queuelist, &rq_list); diff --git a/block/blk-mq.c b/block/blk-mq.c index 255a705f8672..008c975b6f4b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1087,14 +1087,6 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) return true; } -static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget) -{ - struct request_queue *q = hctx->queue; - - if (q->mq_ops->put_budget && got_budget) - q->mq_ops->put_budget(hctx); -} - bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { @@ -1125,7 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * rerun the hardware queue when a tag is freed. */ if (!blk_mq_dispatch_wait_add(hctx)) { - blk_mq_put_budget(hctx, got_budget); + blk_mq_put_dispatch_budget(hctx, got_budget); break; } @@ -1135,16 +1127,13 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * hardware queue to the wait queue. */ if (!blk_mq_get_driver_tag(rq, &hctx, false)) { - blk_mq_put_budget(hctx, got_budget); + blk_mq_put_dispatch_budget(hctx, got_budget); break; } } - if (!got_budget) { - if (q->mq_ops->get_budget && - !q->mq_ops->get_budget(hctx)) - break; - } + if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) + break; list_del_init(&rq->queuelist); @@ -1642,7 +1631,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (!blk_mq_get_driver_tag(rq, NULL, false)) goto insert; - if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) { + if (!blk_mq_get_dispatch_budget(hctx)) { blk_mq_put_driver_tag(rq); goto insert; } diff --git a/block/blk-mq.h b/block/blk-mq.h index 4d12ef08b0a9..9a1426e8b6e5 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -139,4 +139,23 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]); +static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + + if (!q->mq_ops->get_budget) + return true; + + return q->mq_ops->get_budget(hctx); +} + +static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx, + bool got_budget) +{ + struct request_queue *q = hctx->queue; + + if (got_budget && q->mq_ops->put_budget) + q->mq_ops->put_budget(hctx); +} + #endif