From patchwork Tue Apr 10 21:33:23 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tejun Heo <tj@kernel.org>
X-Patchwork-Id: 10333883
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	981526053C for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:33:30 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 84B0D2845C
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:33:30 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7936D28469; Tue, 10 Apr 2018 21:33:30 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E44E02845C
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:33:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752484AbeDJVd2 (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Tue, 10 Apr 2018 17:33:28 -0400
Received: from mail-yb0-f195.google.com ([209.85.213.195]:35494 "EHLO
	mail-yb0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752286AbeDJVd1 (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Tue, 10 Apr 2018 17:33:27 -0400
Received: by mail-yb0-f195.google.com with SMTP id c1-v6so4848512ybm.2;
	Tue, 10 Apr 2018 14:33:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=JFyYXTRq6JVuxjvrwNT2x+Ht3natOWi0EU877kXXgLk=;
	b=WFqgoi3QrJaopWHQiVXQJZEHmxYPFFbKqZKT+++Y+y00S6lFfhEdrc1LN/22yh2LRM
	FCipQnHMqf2grKS5VRqhp7couJsUODTnMciNyrsRoE9MSn1tWw9RX+VfBqlID6D7QE9p
	Brfb6KkoJPESSGi6278SfTfPeYY+zQn4qIyExMpXOrHPbgnDQnRJhA8FjQt2Q6t7dKzL
	AWgs+PiUXENXLRWFZ7GAr2Au13ch8iyriiAJiQPiSsJKNvqjulzQg7TsnbpDkrk9rj2Q
	bSlmloIGpGhs8GcQ7b8tlCoCcyQh4S8SwvqwJZcM0grsIbrZiR6i4gF5F33S4AdRARJQ
	+40w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
	:references:mime-version:content-disposition:in-reply-to:user-agent;
	bh=JFyYXTRq6JVuxjvrwNT2x+Ht3natOWi0EU877kXXgLk=;
	b=jOm1dXo5afwbFSIdDPnejLmMaV26UcvG38rgN7ac69Ms3Sl0H5zu6IhQ/pofZIDya6
	qOn6slmjnTml3Dg4LRS0oSb1vqbp+01qXexKZ9WW56WQARCS4jaJNt1j2K4nns/CDeRp
	llgI6pr2WRDKK0TtsE5D96m9yo4HpWNbHFyFMxA4zSG79Den5gZV5O/BGezTq5+5noxi
	Mfv5lRTMIPmbLEJ0TfdUYtPORb+f/M3aUu4lV8nc0uVAb7KmrXJhcTJ0vwR8CEkGVAj3
	CqBJULNdRmCLtV9au5rM4LW3RxQHIzSaQCfZMrIRGqEcSEsBnsFRurViC+cuLqQqm1Kk
	8K5Q==
X-Gm-Message-State: ALQs6tDALI6MKmO/IJp1PQE5Owv4+4J3q4OLyPZzhmAaiGXUfMzqWxcu
	Xw6N6xJo+iYe/RKRIu9b5ds=
X-Google-Smtp-Source: 
 AIpwx4+N53aQE6XaYoDo0Dg75PWxE5lLKSlfWEsqPN2U1XguzFOpkU8YPhq5EaWhTk5+a40C+/PfSQ==
X-Received: by 2002:a25:db4b:: with SMTP id
	g72-v6mr1229954ybf.403.1523396006440;
	Tue, 10 Apr 2018 14:33:26 -0700 (PDT)
Received: from localhost ([2620:10d:c091:180::1:f676])
	by smtp.gmail.com with ESMTPSA id
	e187sm1531465ywb.23.2018.04.10.14.33.24
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Tue, 10 Apr 2018 14:33:25 -0700 (PDT)
Date: Tue, 10 Apr 2018 14:33:23 -0700
From: "tj@kernel.org" <tj@kernel.org>
To: Ming Lei <ming.lei@redhat.com>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"israelr@mellanox.com" <israelr@mellanox.com>,
	"sagi@grimberg.me" <sagi@grimberg.me>, "hch@lst.de" <hch@lst.de>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"maxg@mellanox.com" <maxg@mellanox.com>
Subject: Re: [PATCH v4] blk-mq: Fix race conditions in request timeout
	handling
Message-ID: <20180410213323.GC793541@devbig577.frc2.facebook.com>
References: <20180410084133.GB9133@ming.t460p>
	<b36ea1107be4f32fb04fffdde3ee45a2e0fcc36c.camel@wdc.com>
	<20180410135541.GA22340@ming.t460p>
	<7c4a8b019182d5a9259ef20e6462b3f6a533abed.camel@wdc.com>
	<20180410143007.GB22340@ming.t460p>
	<e2762e359649946543ee22cf6ebf0c08410f7963.camel@wdc.com>
	<20180410152553.GC22340@ming.t460p>
	<20180410153031.GO3126663@devbig577.frc2.facebook.com>
	<20180410153812.GA3219@ming.t460p>
	<20180410154043.GP3126663@devbig577.frc2.facebook.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20180410154043.GP3126663@devbig577.frc2.facebook.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hello,

On Tue, Apr 10, 2018 at 08:40:43AM -0700, tj@kernel.org wrote:
> On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> > I agree, the issue should be in driver's irq handler and .timeout in
> > theory.
> > 
> > For example, even though one request has been done by irq handler, .timeout
> > still may return RESET_TIMER.
> 
> blk-mq can use a separate flag to track normal completions during
> timeout and complete the request normally on RESET_TIMER if the flag
> is set while EH was in progress.  With a bit of care, we'd be able to
> plug the race completely.

So, something like the following.  Just compile tested.  The extra
dedicated field is kinda unfortunate but we need something like that
no matter how we do this because the involved completion marking
violates the ownership (we want to record the normail completion while
timeout handling is in progress).  Alternatively, we can add an atomic
flags field.  The addition doesn't change the size of struct request
because it fits in a hole but that hole seems removeable, so...

Anyways, it'd be great if someone can test this combined with the
previous two rcu sync patches.

Thanks.
---
 block/blk-mq.c         |   44 +++++++++++++++++++++++++++++++-------------
 include/linux/blkdev.h |    2 ++
 2 files changed, 33 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -881,7 +883,7 @@ static void blk_mq_check_expired(struct
 	}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
 	struct blk_mq_hw_ctx *hctx;
 	bool has_rcu = false;
@@ -896,7 +898,8 @@ static void blk_mq_timeout_sync_rcu(stru
 		else
 			synchronize_srcu(hctx->srcu);
 
-		hctx->need_sync_rcu = false;
+		if (clear)
+			hctx->need_sync_rcu = false;
 	}
 	if (has_rcu)
 		synchronize_rcu();
@@ -917,25 +920,37 @@ static void blk_mq_terminate_expired(str
 		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
 	 * @rq's timer reset has gone through rcu synchronization and is
 	 * visible now.  Allow normal completions again by resetting
 	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-	 * there's no memory ordering around ->aborted_gstate making it the
-	 * only field safe to update.  Let blk_add_timer() clear it later
-	 * when the request is recycled or times out again.
-	 *
-	 * As nothing prevents from completion happening while
-	 * ->aborted_gstate is set, this may lead to ignored completions
-	 * and further spurious timeouts.
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
 	 */
 	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
 		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+		blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -976,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		blk_mq_timeout_sync_rcu(q);
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1003,12 @@ static void blk_mq_timeout_work(struct w
 		 * reset racing against recycling.
 		 */
 		if (nr_resets) {
-			blk_mq_timeout_sync_rcu(q);
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
 			blk_mq_queue_tag_busy_iter(q,
-					blk_mq_finish_timeout_reset, NULL);
+					blk_mq_timeout_reset_cleanup, NULL);
 		}
 	}
 
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the