From patchwork Tue Feb 13 21:20:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 10217615 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 526996055C for ; Tue, 13 Feb 2018 21:20:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 40DFF28CFF for ; Tue, 13 Feb 2018 21:20:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3461428EF8; Tue, 13 Feb 2018 21:20:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C812C28CFF for ; Tue, 13 Feb 2018 21:20:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965749AbeBMVUt (ORCPT ); Tue, 13 Feb 2018 16:20:49 -0500 Received: from mail-qt0-f194.google.com ([209.85.216.194]:33622 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965781AbeBMVUs (ORCPT ); Tue, 13 Feb 2018 16:20:48 -0500 Received: by mail-qt0-f194.google.com with SMTP id d8so5307670qtm.0 for ; Tue, 13 Feb 2018 13:20:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=0842u4xlbaonQwWbISqjNXMsEaoKsM+0lZ5UbhHkO18=; b=amuBxZMS2YJWWkiOyHdHTtNfZozkeefUrIRUllsSdOSlINynlkONdbjeg0UXUl7CDO Qyi8kvrRlshTZVKkNNDIw+CC8uqKajW+LXvu0zVmvxHkezTR7MU0QwK0ZEeeV4/iKFB9 pichHr8kghrN3S+dKICeD95ZUc5u5d97tsAa8kuC+GsshkIA6i3ZZPeVLhe1OfVA45kK uddIYH+JQLZm8da6fFl19670LtZ6Dk3vY61sngpqdTNcsnrPkIaSpwQFu2PCa+g2NUwz I2czUTOEydhIGZI+z5CzpiIqpU1hDnd+QQ2EesqWE2V0EC/nWsspR4jKxOAntGuCHpbE 9Syg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=0842u4xlbaonQwWbISqjNXMsEaoKsM+0lZ5UbhHkO18=; b=lOrlsrnEmJsSQsICakY7qdu3z12aIAPuPSqKawWApyP7cq0YqCUMmz2PIgJSQa1b/D TOsO6F2wDALdlxndTeHcMyA7B80pGXohLnfqtI7+Zle+ezsdoMzI/bHHCWUL2BAIA24O 42ue3CTZbgwLf8oT95iWQBCY4J9XenWSDQqvu50UWptPLqPqomDo236FNpHFMBXC3X1J 6ZmBns2Cumcmcludnjdd5Mndhq2mIKpwJuSW8R3DPOSC0YVHgJ33WskKZSgwBi+h4jqt 3utJXthx4Qvg0TFXTRj/369HP5iRYv3xhaNM/jIMDHL/3SYcfAhAEdpPwcx/EI2NfUDJ dWug== X-Gm-Message-State: APf1xPDsEBKnQ5vn4rBqFKCRtJtOK1pKFH/rzuAid5BOpg7piy9FHEw2 cd07W0iGIwBYpN8gw3uLrjQ= X-Google-Smtp-Source: AH8x226W1xyOMHxezWPZRYlkAFCzq9Lbm63st93Kznw39cg/JSwrDx8+eK6p53UdJz4EqR9AO5sz+Q== X-Received: by 10.200.27.172 with SMTP id z41mr4325790qtj.58.1518556847667; Tue, 13 Feb 2018 13:20:47 -0800 (PST) Received: from localhost (dhcp-ec-8-6b-ed-7a-cf.cpe.echoes.net. [72.28.5.223]) by smtp.gmail.com with ESMTPSA id y9sm6489166qti.20.2018.02.13.13.20.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 13 Feb 2018 13:20:47 -0800 (PST) Date: Tue, 13 Feb 2018 13:20:44 -0800 From: "tj@kernel.org" To: Bart Van Assche Cc: "hch@lst.de" , "linux-block@vger.kernel.org" , "axboe@kernel.dk" Subject: Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling Message-ID: <20180213212044.GS695913@devbig577.frc2.facebook.com> References: <20180207011133.25957-1-bart.vanassche@wdc.com> <20180207170612.GB695913@devbig577.frc2.facebook.com> <1518024428.2870.35.camel@wdc.com> <20180207173531.GC695913@devbig577.frc2.facebook.com> <1518027251.2870.53.camel@wdc.com> <20180207200724.GD695913@devbig577.frc2.facebook.com> <1518047297.2870.80.camel@wdc.com> <1518052193.2870.90.camel@wdc.com> <20180208153940.GM695913@devbig577.frc2.facebook.com> <1518107501.3611.19.camel@wdc.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1518107501.3611.19.camel@wdc.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hello, Bart. Sorry about the delay. On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote: > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The > instruction at that address tries to dereference scsi_cmnd.device (%rax). The > register dump shows that that pointer has the value NULL. The only function I > know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only > caller of that function in the SCSI core is scsi_initialize_rq(). That function > has two callers, namely scsi_init_command() and blk_get_request(). However, > the scsi_cmnd.device pointer is not cleared when a request finishes. This is > why I think that the above crash report indicates that scsi_times_out() was > called for a request that was being reinitialized and not by device hotplugging. Can you please give the following patch a shot? While timeout path is synchornizing against the completion path (and the following re-init) while taking back control of a timed-out request, it wasn't doing that while giving it back, so the timer registration could race against completion and re-issue. I'm still not quite sure how that can lead to the oops tho. Anyways, we need something like this one way or the other. This isn't the final patch. We should add batching-up of rcu synchronize calls similar to the abort path. Thanks. diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102..b66aec3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -816,7 +816,8 @@ struct blk_mq_timeout_data { unsigned int nr_expired; }; -static void blk_mq_rq_timed_out(struct request *req, bool reserved) +static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req, + bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; @@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) * ->aborted_gstate is set, this may lead to ignored * completions and further spurious timeouts. */ - blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + synchronize_rcu(); + else + synchronize_srcu(hctx->srcu); + blk_mq_rq_update_aborted_gstate(req, 0); break; case BLK_EH_NOT_HANDLED: break; @@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, */ if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && READ_ONCE(rq->gstate) == rq->aborted_gstate) - blk_mq_rq_timed_out(rq, reserved); + blk_mq_rq_timed_out(hctx, rq, reserved); } static void blk_mq_timeout_work(struct work_struct *work)