From patchwork Thu Apr  2 15:51:29 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Doug Anderson <dianders@chromium.org>
X-Patchwork-Id: 11471077
Return-Path: <SRS0=2u2q=5S=vger.kernel.org=linux-block-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 61C9A14B4
	for <patchwork-linux-block@patchwork.kernel.org>;
 Thu,  2 Apr 2020 15:52:08 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 40E0920757
	for <patchwork-linux-block@patchwork.kernel.org>;
 Thu,  2 Apr 2020 15:52:08 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org
 header.b="ErbvY8r5"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389444AbgDBPwG (ORCPT
        <rfc822;patchwork-linux-block@patchwork.kernel.org>);
        Thu, 2 Apr 2020 11:52:06 -0400
Received: from mail-pg1-f196.google.com ([209.85.215.196]:35648 "EHLO
        mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S2389469AbgDBPwG (ORCPT
        <rfc822;linux-block@vger.kernel.org>); Thu, 2 Apr 2020 11:52:06 -0400
Received: by mail-pg1-f196.google.com with SMTP id k5so2039044pga.2
        for <linux-block@vger.kernel.org>;
 Thu, 02 Apr 2020 08:52:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=chromium.org; s=google;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=91TGSK7f3sZaNHrNjwv3lknFeXeezpxf66N2To7Vp6E=;
        b=ErbvY8r5SSroUWCgQcdORhlfrOOLrwx14vPvHlg/HqOvXU3jY+UWBJbMt0u3NpWhkg
         ZrrfQ+rYXIpGE2A8SN1byjVo3YEHplZeah09GFCUfFEOGiSR8cp+Xwzkq7CgbLE7VQet
         cYOnWvck9Qo2PnxG8bHN18C87gB0ojosODpbY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=91TGSK7f3sZaNHrNjwv3lknFeXeezpxf66N2To7Vp6E=;
        b=lzQ5ts9xxoOjvpS1MWkt5F6N8sbetNN6qsn5P2R/tO42QCv2pMpNuyzBIjIbsSvP7F
         lG434HD+zWCVato7I5TjfFjwNB0zDYRsq56YV5JtpHrhWXUxyscSqwCioVHvgQTBJY2R
         hLVuLdwGP2k2sQlJb5wxKq0VDO4ypWu5r/UmJgCEwA9kt433ybhk9YXDg/16tSlIzXkS
         vvI7utOeHh9s0gjF+jCtAuRFtUCbGOUTPb1jtaGq2kjBiI9R43TphfCWbYN1b/BmW+Yw
         vM0WEtZTw+tI8ezIKiOgbbwRXzJStYRHgITRmvF8v4zkYA+DQvuSOeULHCNktvUVXnqx
         Tc/Q==
X-Gm-Message-State: AGi0PuY6LKOKvrh4gC9Lnhz6oWYK4lQ9ah3ypGA/JGzvEh6aAMj3I6FY
        bWelkBtDUdXYZg/gfmRKrLpaCA==
X-Google-Smtp-Source: 
 APiQypImzOCBUNtsbiPzjiDW9PuoxYILaMINgF1fKvv0r+xRLgvzitqDl42wjh7kUfwQMvp2cwkd2w==
X-Received: by 2002:a63:134e:: with SMTP id 14mr3943484pgt.380.1585842725090;
        Thu, 02 Apr 2020 08:52:05 -0700 (PDT)
Received: from tictac2.mtv.corp.google.com
 ([2620:15c:202:1:24fa:e766:52c9:e3b2])
        by smtp.gmail.com with ESMTPSA id
 x68sm2578815pfb.5.2020.04.02.08.52.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 02 Apr 2020 08:52:04 -0700 (PDT)
From: Douglas Anderson <dianders@chromium.org>
To: axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com
Cc: paolo.valente@linaro.org, sqazi@google.com,
        linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
        Ming Lei <ming.lei@redhat.com>, groeck@chromium.org,
        Douglas Anderson <dianders@chromium.org>,
        linux-kernel@vger.kernel.org
Subject: [PATCH v2 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a
 reason to kick
Date: Thu,  2 Apr 2020 08:51:29 -0700
Message-Id: 
 <20200402085050.v2.1.I1f95c459e51962b8d2c83e869913b6befda2255c@changeid>
X-Mailer: git-send-email 2.26.0.rc2.310.g2932bb562d-goog
In-Reply-To: <20200402155130.8264-1-dianders@chromium.org>
References: <20200402155130.8264-1-dianders@chromium.org>
MIME-Version: 1.0
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
true and the driver returns BLK_STS_RESOURCE then we'll kick the
queue.  However, there's another case where we might need to kick it.
If we were unable to get budget we can be in much the same state as
when the driver returns BLK_STS_RESOURCE, so we should treat it the
same.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---

Changes in v2: None

 block/blk-mq.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..2cd8d2b49ff4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1189,6 +1189,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	bool no_budget_avail = false;
 
 	if (list_empty(list))
 		return false;
@@ -1205,8 +1206,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		rq = list_first_entry(list, struct request, queuelist);
 
 		hctx = rq->mq_hctx;
-		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
+		if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) {
+			no_budget_avail = true;
 			break;
+		}
 
 		if (!blk_mq_get_driver_tag(rq)) {
 			/*
@@ -1311,13 +1314,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 *
 		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
 		 * bit is set, run queue after a delay to avoid IO stalls
-		 * that could otherwise occur if the queue is idle.
+		 * that could otherwise occur if the queue is idle.  We'll do
+		 * similar if we couldn't get budget and SCHED_RESTART is set.
 		 */
 		needs_restart = blk_mq_sched_needs_restart(hctx);
 		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
-		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+		else if (needs_restart && (ret == BLK_STS_RESOURCE ||
+					   no_budget_avail))
 			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
 
 		blk_mq_update_dispatch_busy(hctx, true);

From patchwork Thu Apr  2 15:51:30 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Doug Anderson <dianders@chromium.org>
X-Patchwork-Id: 11471079
Return-Path: <SRS0=2u2q=5S=vger.kernel.org=linux-block-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E30591668
	for <patchwork-linux-block@patchwork.kernel.org>;
 Thu,  2 Apr 2020 15:52:16 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id B824520787
	for <patchwork-linux-block@patchwork.kernel.org>;
 Thu,  2 Apr 2020 15:52:16 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org
 header.b="oXd2fQ64"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389469AbgDBPwP (ORCPT
        <rfc822;patchwork-linux-block@patchwork.kernel.org>);
        Thu, 2 Apr 2020 11:52:15 -0400
Received: from mail-pf1-f180.google.com ([209.85.210.180]:37548 "EHLO
        mail-pf1-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S2389529AbgDBPwI (ORCPT
        <rfc822;linux-block@vger.kernel.org>); Thu, 2 Apr 2020 11:52:08 -0400
Received: by mail-pf1-f180.google.com with SMTP id u65so1946216pfb.4
        for <linux-block@vger.kernel.org>;
 Thu, 02 Apr 2020 08:52:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=chromium.org; s=google;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=7Q3jBwdmyUkdTL07RQ4evpZO2DzG9gMebwuAt0SZP0k=;
        b=oXd2fQ64qrNWa66OlP7Ra3x06XYTxk19Ian0YyjvHZj5/xAdfi6Rn+rhULcFiJJkFA
         xrhIx+PkuvI+401s9kuz3pEQfYye0k4XFNSTEoyu5ksbpOWYrAuxoYBh5NHO6bQjYIPH
         MVgOr6xCzmp/PtPuWeAYDTYLg0N6T5ue/nISU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=7Q3jBwdmyUkdTL07RQ4evpZO2DzG9gMebwuAt0SZP0k=;
        b=josq+UL3os/hV4g8rG62O2VzjebM/QAiuZhcyw+M7W1xlFfxNLk9zGzyKxyYGFxEWJ
         2fqyVf+FWSuxbIoWH+GF2wrZgTWNy+e46jsLLCCPl3muw3pbPXaJY6rT6P+vkBcBrBcg
         euuAgefAhq2GjvJBnkrseH7L6deeAfHUOevUZTGCC1CfvX3xsXJtB85ZeukMG7RcP3Hh
         3EglpSzvdZrWEZGcQooexCUW4JJJ5SnLgcJcsW1atPukKqcoNy1PVKYVjuXtxJmuEt04
         tvppDaclWtt/FxWHVbRtAk1WHf16q6pE5cX8k+ur4fNnBcc2aryYWj2FC+QvWywjIrTt
         GF8w==
X-Gm-Message-State: AGi0PuaVTnAjb+X+RjNeMAOf5U4vTsJKgNPNsA38tGS+NZjCxBwQw7rR
        HJ8cxSo//OUoIr74J56UzBUa3g==
X-Google-Smtp-Source: 
 APiQypLIOJvFZ3n8tsGjP8rKreVP3/DZYTDGeAQSmvs04KJAkllBKQ3g3zeXbEbSSbGmDvAMgnIiOw==
X-Received: by 2002:aa7:9dc1:: with SMTP id g1mr3832763pfq.308.1585842726489;
        Thu, 02 Apr 2020 08:52:06 -0700 (PDT)
Received: from tictac2.mtv.corp.google.com
 ([2620:15c:202:1:24fa:e766:52c9:e3b2])
        by smtp.gmail.com with ESMTPSA id
 x68sm2578815pfb.5.2020.04.02.08.52.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 02 Apr 2020 08:52:05 -0700 (PDT)
From: Douglas Anderson <dianders@chromium.org>
To: axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com
Cc: paolo.valente@linaro.org, sqazi@google.com,
        linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
        Ming Lei <ming.lei@redhat.com>, groeck@chromium.org,
        Douglas Anderson <dianders@chromium.org>,
        Ajay Joshi <ajay.joshi@wdc.com>, Arnd Bergmann <arnd@arndb.de>,
        Bart Van Assche <bvanassche@acm.org>,
        Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Hou Tao <houtao1@huawei.com>,
        Pavel Begunkov <asml.silence@gmail.com>,
        Tejun Heo <tj@kernel.org>, linux-kernel@vger.kernel.org
Subject: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget
 contention
Date: Thu,  2 Apr 2020 08:51:30 -0700
Message-Id: 
 <20200402085050.v2.2.I28278ef8ea27afc0ec7e597752a6d4e58c16176f@changeid>
X-Mailer: git-send-email 2.26.0.rc2.310.g2932bb562d-goog
In-Reply-To: <20200402155130.8264-1-dianders@chromium.org>
References: <20200402155130.8264-1-dianders@chromium.org>
MIME-Version: 1.0
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

It is possible for two threads to be running
blk_mq_do_dispatch_sched() at the same time with the same "hctx".
This is because there can be more than one caller to
__blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
prevent more than one thread from entering.

If more than one thread is running blk_mq_do_dispatch_sched() at the
same time with the same "hctx", they may have contention acquiring
budget.  The blk_mq_get_dispatch_budget() can eventually translate
into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
uncommon) then only one of the two threads will be the one to
increment "device_busy" to 1 and get the budget.

The losing thread will break out of blk_mq_do_dispatch_sched() and
will stop dispatching requests.  The assumption is that when more
budget is available later (when existing transactions finish) the
queue will be kicked again, perhaps in scsi_end_request().

The winning thread now has budget and can go on to call
dispatch_request().  If dispatch_request() returns NULL here then we
have a potential problem.  Specifically we'll now call
blk_mq_put_dispatch_budget() which translates into
scsi_mq_put_budget().  That will mark the device as no longer busy but
doesn't do anything to kick the queue.  This violates the assumption
that the queue would be kicked when more budget was available.

Pictorially:

Thread A                          Thread B
================================= ==================================
blk_mq_get_dispatch_budget() => 1
dispatch_request() => NULL
                                  blk_mq_get_dispatch_budget() => 0
                                  // because Thread A marked
                                  // "device_busy" in scsi_device
blk_mq_put_dispatch_budget()

The above case was observed in reboot tests and caused a task to hang
forever waiting for IO to complete.  Traces showed that in fact two
tasks were running blk_mq_do_dispatch_sched() at the same time with
the same "hctx".  The task that got the budget did in fact see
dispatch_request() return NULL.  Both tasks returned and the system
went on for several minutes (until the hung task delay kicked in)
without the given "hctx" showing up again in traces.

Let's attempt to fix this problem by detecting if there was contention
for the budget in the case where we put the budget without dispatching
anything.  If we saw contention we kick all hctx's associated with the
queue where there was contention.  We do this without any locking by
adding a double-check for budget and accepting a small amount of faux
contention if the 2nd check gives us budget but then we don't dispatch
anything (we'll look like we contended with ourselves).

A few extra notes:

- This whole thing is only a problem due to the inexact nature of
  has_work().  Specifically if has_work() always guaranteed that a
  "true" return meant that dispatch_request() would return non-NULL
  then we wouldn't have this problem.  That's because we only grab the
  budget if has_work() returned true.  If we had the non-NULL
  guarantee then at least one of the threads would actually dispatch
  work and when the work was done then queues would be kicked
  normally.

- One specific I/O scheduler that trips this problem quite a bit is
  BFQ which definitely returns "true" for has_work() in cases where it
  wouldn't dispatch.  Making BFQ's has_work() more exact requires that
  has_work() becomes a much heavier function, including figuring out
  how to acquire spinlocks in has_work() without tripping circular
  lock dependencies.  This is prototyped but it's unclear if it's
  really the way to go when the problem can be solved with a
  relatively lightweight contention detection mechanism.

- Because this problem only trips with inexact has_work() it's
  believed that blk_mq_do_dispatch_ctx() does not need a similar
  change.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")

 block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
 block/blk-mq.c         |  3 +++
 include/linux/blkdev.h |  2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..0195d75f5f96 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
-		if (!blk_mq_get_dispatch_budget(hctx))
-			break;
+
+		if (!blk_mq_get_dispatch_budget(hctx)) {
+			/*
+			 * We didn't get budget so set contention.  If
+			 * someone else had the budget but didn't dispatch
+			 * they'll kick everything.  NOTE: we check one
+			 * extra time _after_ setting contention to fully
+			 * close the race.  If we don't actually dispatch
+			 * we'll detext faux contention (with ourselves)
+			 * but that should be rare.
+			 */
+			atomic_set(&q->budget_contention, 1);
+
+			if (!blk_mq_get_dispatch_budget(hctx))
+				break;
+		}
 
 		rq = e->type->ops.dispatch_request(hctx);
 		if (!rq) {
 			blk_mq_put_dispatch_budget(hctx);
+
+			/*
+			 * We've released the budget but us holding it might
+			 * have prevented someone else from dispatching.
+			 * Detect that case and run all queues again.
+			 */
+			if (atomic_read(&q->budget_contention))
+				blk_mq_run_hw_queues(q, true);
 			break;
 		}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2cd8d2b49ff4..6163c43ceca5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	/* We're running the queues, so clear the contention detector */
+	atomic_set(&q->budget_contention, 0);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..07f21e45d993 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -583,6 +583,8 @@ struct request_queue {
 
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
+
+	atomic_t		budget_contention;
 };
 
 #define QUEUE_FLAG_STOPPED	0	/* queue is stopped */