From patchwork Tue Jun 27 06:37:51 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sagi Grimberg <sagi@grimberg.me>
X-Patchwork-Id: 9810875
Return-Path: <linux-rdma-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	9BC3860351 for <patchwork-linux-rdma@patchwork.kernel.org>;
	Tue, 27 Jun 2017 06:38:05 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 81958285DC
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Tue, 27 Jun 2017 06:38:05 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 75F9D285E3; Tue, 27 Jun 2017 06:38:05 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 810DB285DC
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Tue, 27 Jun 2017 06:38:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751491AbdF0Gh4 (ORCPT
	<rfc822;patchwork-linux-rdma@patchwork.kernel.org>);
	Tue, 27 Jun 2017 02:37:56 -0400
Received: from mail-wm0-f66.google.com ([74.125.82.66]:33995 "EHLO
	mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751412AbdF0Ghz (ORCPT
	<rfc822; linux-rdma@vger.kernel.org>); Tue, 27 Jun 2017 02:37:55 -0400
Received: by mail-wm0-f66.google.com with SMTP id f134so3949049wme.1
	for <linux-rdma@vger.kernel.org>;
	Mon, 26 Jun 2017 23:37:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:subject:to:cc:references:from:message-id:date
	:user-agent:mime-version:in-reply-to:content-language
	:content-transfer-encoding;
	bh=OAAf09mEVQ76BEkBVE7Gktr0BFSDFFZGC1nIpb4PRso=;
	b=snFWGLYzQxY3ayp88GTEfEbWmcwvqYXu2PiFCWe+RsSzuC2OpE24as4rsAd1mplluL
	KKhGTekP2Xl6bKaZQaysNPR4VTjNXbK/bI/BNwcDNWXN8pr+bbwxisj3TX6kB0MswBNw
	3azyx92IEHR727oe29zB8wbXzONrzordMajcGLNTNPVpB7jNlpIgy3PqzKql7OSn5ZJD
	pZkDGqmdQgKxxQ6v0kTlkNR205H7w1mO8aIGSY2rmD8HAHdDnggcPbeuXO38lUR0CZf6
	jU0S+mZj/9U5JkT0lQAHDdWZbHXbAelArq9ZnDI0CjU+ItCf/709YHx6NYb419kAbw6G
	GEFQ==
X-Gm-Message-State: AKS2vOyh2vuW8a9GmlwCsEUubx922AWac/+a6gIiz2JbyWzrdGKEhERd
	URGzcEGr54Qiyw==
X-Received: by 10.80.129.4 with SMTP id 4mr2742706edc.120.1498545474078;
	Mon, 26 Jun 2017 23:37:54 -0700 (PDT)
Received: from [192.168.0.104] ([46.120.250.42])
	by smtp.gmail.com with ESMTPSA id
	d56sm1013982ede.42.2017.06.26.23.37.52
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Mon, 26 Jun 2017 23:37:53 -0700 (PDT)
Subject: Re: Deadlock on device removal event for NVMeF target
To: Shiraz Saleem <shiraz.saleem@intel.com>, hch@lst.de
Cc: linux-rdma@vger.kernel.org, linux-nvme <linux-nvme@lists.infradead.org>
References: <20170626225920.GA11700@ssaleem-MOBL4.amr.corp.intel.com>
From: Sagi Grimberg <sagi@grimberg.me>
Message-ID: <56030fcd-b8a0-fc0e-18e5-985ebf16a82e@grimberg.me>
Date: Tue, 27 Jun 2017 09:37:51 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.1.1
MIME-Version: 1.0
In-Reply-To: <20170626225920.GA11700@ssaleem-MOBL4.amr.corp.intel.com>
Content-Language: en-US
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

> Hi Sagi/Christoph,

Hi Shiraz,

Please CC linux-nvme for nvme-rdma related stuff.

> I am seeing a deadlock for a device removal event on NVMeF target.
> 
> The sequence of events leading to the deadlock are as follows,
> 
> 1. i40iw posts IW_CM_EVENT_CLOSE events for all QPs causing the corresponding
> NVMet RDMA Queues to disconnect and schedule release of any pending work on WQ
> 2. i40iw triggers device removal
> 	ib_unregister_device
> 	[..]
> 	cma_remove_id_dev (takes a handler lock before calling the event handler)
> 	nvmet_rdma_cm_handler
> 	nvmet_rdma_device_removal (queue->state = NVMET_RDMA_Q_DISCONNECTING due to 1.)
> 	flush_scheduled_work (blocks till all scheduled work is drained from WQ)
> 	nvmet_rdma_release_queue_work (queue->state = NVMET_RDMA_Q_DISCONNECTING)
> 	rdma_destroy_id (waits on the same handler lock as cma_remove_id_dev causing the deadlock)
>       
> So this problem can occur when there is a device removal event while the queue is in
> disconnect state with the some oustanding work that hasnt been drained from the WQ at the
> time flush_scheduled_work is called.

This indeed looks like a bug (thanks for reporting!). We indeed don't
have sufficient information on where the queue release procedure is by
only looking at the queue state, we can't tell if rdma_destroy_id was
invoked and we can deadlock with rdma_destroy_id.

> This patch fixed the problem but I am not sure if it is the correct fix.
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 9e45cde..d0fb307 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1349,6 +1349,12 @@ static int nvmet_rdma_device_removal(struct rdma_cm_id *cm_id,
>                  spin_lock_irqsave(&queue->state_lock, flags);
>                  if (queue->state != NVMET_RDMA_Q_DISCONNECTING)
>                          queue->state = NVMET_RDMA_IN_DEVICE_REMOVAL;
> +               else {
> +                       /*queue is disconnecting; so cm_id and queues will be destroyed*/
> +                       spin_unlock_irqrestore(&queue->state_lock, flags);
> +                       return 0;
> +               }
> +
>                  spin_unlock_irqrestore(&queue->state_lock, flags);
>                  nvmet_rdma_queue_disconnect(queue);
>                  flush_scheduled_work();
> 

The problem with your patch, is that it introduces a race in the
opposite direction. The queue state does not indicate that the queue
release work has _started_, it only indicates that it was queued.
The DEVICE_REMOVAL event expects that by the end of the callout _all_
associated resources have been freed and the device can safely continue
with its teardown flow.

How about the (untested) alternative below:
---
[PATCH] nvmet-rdma: register ib_client to not deadlock in device
  removal

We can deadlock in case we got to a device removal
event on a queue which is already in the process of
destroying the cm_id is this is blocking until all
events on this cm_id will drain. On the other hand
we cannot guarantee that rdma_destroy_id was invoked
as we only have indication that the queue disconnect
flow has been queued (the queue state is updated before
the realease work has been queued).

So, we leave all the queue removal to a separate ib_client
to avoid this deadlock as ib_client device removal is in
a different context than the cm_id itself.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
  drivers/nvme/target/rdma.c | 100 
++++++++++++++++++++++++++++++---------------
  1 file changed, 67 insertions(+), 33 deletions(-)

          * it's own ID.  What a great API design..
@@ -1519,9 +1510,51 @@ static struct nvmet_fabrics_ops nvmet_rdma_ops = {
         .delete_ctrl            = nvmet_rdma_delete_ctrl,
  };

+static void nvmet_rdma_add_one(struct ib_device *ib_device)
+{
+}
+
+static void nvmet_rdma_remove_one(struct ib_device *ib_device, void 
*client_data)
+{
+       struct nvmet_rdma_queue *queue;
+
+       /* Device is being removed, delete all queues using this device */
+       mutex_lock(&nvmet_rdma_queue_mutex);
+       list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
+               if (queue->dev->device != ib_device)
+                       continue;
+
+               pr_info("Removing queue %d\n", queue->idx);
+               __nvmet_rdma_queue_disconnect(queue);
+       }
+       mutex_unlock(&nvmet_rdma_queue_mutex);
+
+       flush_scheduled_work();
+}
+
+static struct ib_client nvmet_rdma_ib_client = {
+       .name   = "nvmet_rdma",
+       .add = nvmet_rdma_add_one,
+       .remove = nvmet_rdma_remove_one
+};
+
  static int __init nvmet_rdma_init(void)
  {
-       return nvmet_register_transport(&nvmet_rdma_ops);
+       int ret;
+
+       ret = ib_register_client(&nvmet_rdma_ib_client);
+       if (ret)
+               return ret;
+
+       ret = nvmet_register_transport(&nvmet_rdma_ops);
+       if (ret)
+               goto err_ib_client;
+
+       return 0;
+
+err_ib_client:
+       ib_unregister_client(&nvmet_rdma_ib_client);
+       return ret;
  }

  static void __exit nvmet_rdma_exit(void)
@@ -1544,6 +1577,7 @@ static void __exit nvmet_rdma_exit(void)
         mutex_unlock(&nvmet_rdma_queue_mutex);

         flush_scheduled_work();
+       ib_unregister_client(&nvmet_rdma_ib_client);
         ida_destroy(&nvmet_rdma_queue_ida);
  }

--

Unfortunately, the patch failed to produce a clear diff :/ sorry...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 32aa10b521c8..56a4cba690b5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1307,53 +1307,44 @@ static void nvmet_rdma_queue_connect_fail(struct 
rdma_cm_id *cm_id,

  /**
   * nvme_rdma_device_removal() - Handle RDMA device removal
+ * @cm_id:     rdma_cm id, used for nvmet port
   * @queue:      nvmet rdma queue (cm id qp_context)
- * @addr:      nvmet address (cm_id context)
   *
   * DEVICE_REMOVAL event notifies us that the RDMA device is about
- * to unplug so we should take care of destroying our RDMA resources.
- * This event will be generated for each allocated cm_id.
+ * to unplug. Note that this event can be generated on a normal
+ * queue cm_id and/or a device bound listener cm_id (where in this
+ * case queue will be null).
   *
- * Note that this event can be generated on a normal queue cm_id
- * and/or a device bound listener cm_id (where in this case
- * queue will be null).
- *
- * we claim ownership on destroying the cm_id. For queues we move
- * the queue state to NVMET_RDMA_IN_DEVICE_REMOVAL and for port
+ * We registered an ib_client to handle device removal for queues,
+ * so we only need to handle the listening port cm_ids. In this case
   * we nullify the priv to prevent double cm_id destruction and destroying
   * the cm_id implicitely by returning a non-zero rc to the callout.
   */
  static int nvmet_rdma_device_removal(struct rdma_cm_id *cm_id,
                 struct nvmet_rdma_queue *queue)
  {
-       unsigned long flags;
-
-       if (!queue) {
-               struct nvmet_port *port = cm_id->context;
+       struct nvmet_port *port;

+       if (queue) {
                 /*
-                * This is a listener cm_id. Make sure that
-                * future remove_port won't invoke a double
-                * cm_id destroy. use atomic xchg to make sure
-                * we don't compete with remove_port.
-                */
-               if (xchg(&port->priv, NULL) != cm_id)
-                       return 0;
-       } else {
-               /*
-                * This is a queue cm_id. Make sure that
-                * release queue will not destroy the cm_id
-                * and schedule all ctrl queues removal (only
-                * if the queue is not disconnecting already).
+                * This is a queue cm_id. we have registered
+                * an ib_client to handle queues removal
+                * so don't interfear and just return.
                  */
-               spin_lock_irqsave(&queue->state_lock, flags);
-               if (queue->state != NVMET_RDMA_Q_DISCONNECTING)
-                       queue->state = NVMET_RDMA_IN_DEVICE_REMOVAL;
-               spin_unlock_irqrestore(&queue->state_lock, flags);
-               nvmet_rdma_queue_disconnect(queue);
-               flush_scheduled_work();
+               return 0;
         }

+       port = cm_id->context;
+
+       /*
+        * This is a listener cm_id. Make sure that
+        * future remove_port won't invoke a double
+        * cm_id destroy. use atomic xchg to make sure
+        * we don't compete with remove_port.
+        */
+       if (xchg(&port->priv, NULL) != cm_id)
+               return 0;
+
         /*
          * We need to return 1 so that the core will destroy