From patchwork Tue Feb 28 16:57:20 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 9596407 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1CCD4600CB for ; Tue, 28 Feb 2017 17:38:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1275C28159 for ; Tue, 28 Feb 2017 17:38:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 072C028492; Tue, 28 Feb 2017 17:38:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8E9DD28159 for ; Tue, 28 Feb 2017 17:38:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751390AbdB1RiH (ORCPT ); Tue, 28 Feb 2017 12:38:07 -0500 Received: from mga01.intel.com ([192.55.52.88]:12306 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751135AbdB1RiG (ORCPT ); Tue, 28 Feb 2017 12:38:06 -0500 Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Feb 2017 08:49:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.35,220,1484035200"; d="scan'208";a="939042958" Received: from unknown (HELO localhost.localdomain) ([10.232.112.96]) by orsmga003.jf.intel.com with ESMTP; 28 Feb 2017 08:49:05 -0800 Date: Tue, 28 Feb 2017 11:57:20 -0500 From: Keith Busch To: Artur Paszkiewicz Cc: Sagi Grimberg , linux-block@vger.kernel.org, Jens Axboe , Marc MERLIN , Christoph Hellwig , linux-nvme@lists.infradead.org Subject: Re: [PATCHv2 2/2] nvme: Complete all stuck requests Message-ID: <20170228165719.GA23236@localhost.localdomain> References: <1487896561-10454-1-git-send-email-keith.busch@intel.com> <1487896561-10454-2-git-send-email-keith.busch@intel.com> <20170227150107.GA5789@localhost.localdomain> <4c952c67-9871-249d-86b0-5f81acf00571@grimberg.me> <20170227191459.GA1170@localhost.localdomain> <913d4d24-b80b-b7f6-2374-8095bc6fc047@intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <913d4d24-b80b-b7f6-2374-8095bc6fc047@intel.com> User-Agent: Mutt/1.7.0 (2016-08-17) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Feb 28, 2017 at 08:42:19AM +0100, Artur Paszkiewicz wrote: > > I'm observing the same thing when hibernating during mdraid resync on > nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot > CPUs ...". The patch guarantees forward progress for blk-mq's hot-cpu notifier on nvme request queues by failing all entered requests. It sounds like some part of your setup needs those requests to succeed in order to hibernate. If your mdraid uses a stacking request_queue that submits retries while it's request queue is entered, that may explain how you remain stuck at blk_mq_freeze_queue_wait. > This patch did not help but when I put nvme_wait_freeze() > right after nvme_start_freeze() it appeared to be working. Maybe the > difference here is that requests are submitted from a non-freezable > kernel thread (md sync_thread)? Wait freeze prior to quiescing the queue is ok when the controller is functioning, but it'd be impossible to complete a reset if the controller is in a failed or degraded state. We probably want to give those requests a chance to succeed, and I think we'd need to be able to timeout the freeze wait. Below are two patches I tested. Prior to these, the fio test would report IO errors from some of its jobs; no errors with these. --- block/blk-mq.c | 9 +++++++++ include/linux/blk-mq.h | 2 ++ 2 files changed, 11 insertions(+) -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 8da2c04..a5e66a7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -81,6 +81,15 @@ void blk_mq_freeze_queue_wait(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait); +int blk_mq_freeze_queue_wait_timeout(struct request_queue *q, + unsigned long timeout) +{ + return wait_event_timeout(q->mq_freeze_wq, + percpu_ref_is_zero(&q->q_usage_counter), + timeout); +} +EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout); + /* * Guarantee no request is in use, so we can change any data structure of * the queue afterward. diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8dacf68..b296a90 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -246,6 +246,8 @@ void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_unfreeze_queue(struct request_queue *q); void blk_mq_freeze_queue_start(struct request_queue *q); void blk_mq_freeze_queue_wait(struct request_queue *q); +int blk_mq_freeze_queue_wait_timeout(struct request_queue *q, + unsigned long timeout); int blk_mq_reinit_tagset(struct blk_mq_tag_set *set); int blk_mq_map_queues(struct blk_mq_tag_set *set); -- --- drivers/nvme/host/core.c | 14 ++++++++++++++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 18 ++++++++++++++---- 3 files changed, 29 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9e99b94..9b3b57f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2355,6 +2355,20 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_unfreeze); +void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout) +{ + struct nvme_ns *ns; + + mutex_lock(&ctrl->namespaces_mutex); + list_for_each_entry(ns, &ctrl->namespaces, list) { + timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout); + if (timeout <= 0) + break; + } + mutex_unlock(&ctrl->namespaces_mutex); +} +EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout); + void nvme_wait_freeze(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 62af901..2aa20e3 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -296,6 +296,7 @@ void nvme_start_queues(struct nvme_ctrl *ctrl); void nvme_kill_queues(struct nvme_ctrl *ctrl); void nvme_unfreeze(struct nvme_ctrl *ctrl); void nvme_wait_freeze(struct nvme_ctrl *ctrl); +void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout); void nvme_start_freeze(struct nvme_ctrl *ctrl); #define NVME_QID_ANY -1 diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b266fb9..a7a423f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1691,22 +1691,32 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) { int i, queues; u32 csts = -1; + bool dead; + struct pci_dev *pdev = to_pci_dev(dev->dev); del_timer_sync(&dev->watchdog_timer); mutex_lock(&dev->shutdown_lock); if (shutdown) nvme_start_freeze(&dev->ctrl); - if (pci_is_enabled(to_pci_dev(dev->dev))) { - nvme_stop_queues(&dev->ctrl); + if (pci_is_enabled(pdev)) csts = readl(dev->bar + NVME_REG_CSTS); - } + dead = (csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) || + pdev->error_state != pci_channel_io_normal; + + /* + * Give the controller a chance to complete all entered requests if + * doing a safe shutdown. + */ + if (!dead && shutdown) + nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT); + nvme_stop_queues(&dev->ctrl); queues = dev->online_queues - 1; for (i = dev->queue_count - 1; i > 0; i--) nvme_suspend_queue(dev->queues[i]); - if (csts & NVME_CSTS_CFS || !(csts & NVME_CSTS_RDY)) { + if (dead) { /* A device might become IO incapable very soon during * probe, before the admin queue is configured. Thus, * queue_count can be 0 here.