From patchwork Wed May  3 10:05:15 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Marta Rybczynska <mrybczyn@kalray.eu>
X-Patchwork-Id: 9709119
Return-Path: <linux-rdma-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	A31D160351 for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed,  3 May 2017 10:05:22 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 82EF62858D
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed,  3 May 2017 10:05:22 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 777D2285E2; Wed,  3 May 2017 10:05:22 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 76B482858D
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed,  3 May 2017 10:05:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753040AbdECKFT (ORCPT
	<rfc822;patchwork-linux-rdma@patchwork.kernel.org>);
	Wed, 3 May 2017 06:05:19 -0400
Received: from zimbra1.kalray.eu ([92.103.151.219]:45487 "EHLO
	zimbra1.kalray.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752713AbdECKFS (ORCPT
	<rfc822; linux-rdma@vger.kernel.org>); Wed, 3 May 2017 06:05:18 -0400
Received: from localhost (localhost [127.0.0.1])
	by zimbra1.kalray.eu (Postfix) with ESMTP id 387192810DA;
	Wed,  3 May 2017 12:05:16 +0200 (CEST)
Received: from zimbra1.kalray.eu ([127.0.0.1])
	by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10032)
	with ESMTP id 8JVk30VfV1dg; Wed,  3 May 2017 12:05:15 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
	by zimbra1.kalray.eu (Postfix) with ESMTP id 9E28A2810D6;
	Wed,  3 May 2017 12:05:15 +0200 (CEST)
DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra1.kalray.eu 9E28A2810D6
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kalray.eu;
	s=32AE1B44-9502-11E5-BA35-3734643DEF29; t=1493805915;
	bh=UvFbJ2Ode/NLT+Jfa2iYeYEEZJUftwpEov5qYnDetqU=;
	h=Date:From:To:Message-ID:Subject:MIME-Version:Content-Type:
	Content-Transfer-Encoding;
	b=g/CV1x/wnKBEFlySsU4bCmQEdZEfA5f9h/jv249ago6yIH2oGLwu6xeZ46y0bficf
	zudwes3WOxpVS2s+kydfowht/Sfz2ybM28g+8PrDhUqHzwzhsjg5i5TO6g11jBJKwn
	eUQZ1rM+zqbBuIv5oIIaW9BFilqHx+IrphZYSWN0=
X-Virus-Scanned: amavisd-new at kalray.eu
Received: from zimbra1.kalray.eu ([127.0.0.1])
	by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10026)
	with ESMTP id djpbkXB9mh7T; Wed,  3 May 2017 12:05:15 +0200 (CEST)
Received: from zimbra1.kalray.eu (localhost [127.0.0.1])
	by zimbra1.kalray.eu (Postfix) with ESMTP id 84F36280C29;
	Wed,  3 May 2017 12:05:15 +0200 (CEST)
Date: Wed, 3 May 2017 12:05:15 +0200 (CEST)
From: Marta Rybczynska <mrybczyn@kalray.eu>
To: linux-rdma@vger.kernel.org, linux-nvme@lists.infradead.org,
	Sagi Grimberg <sagi@grimberg.me>,
	Leon Romanovsky <leonro@mellanox.com>, axboe@fb.com,
	Max Gurtovoy <maxg@mellanox.com>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>,
	Christoph Hellwig <hch@lst.de>, Keith Busch <keith.busch@intel.com>,
	Doug Ledford <dledford@redhat.com>,
	Bart Van Assche <bart.vanassche@sandisk.com>,
	samuel jones <samuel.jones@kalray.eu>
Message-ID: <79901165.5342369.1493805915415.JavaMail.zimbra@kalray.eu>
Subject: [PATCH v4, under testing] nvme-rdma: support devices with queue
	size < 32
MIME-Version: 1.0
X-Originating-IP: [192.168.37.210]
X-Mailer: Zimbra 8.6.0_GA_1182 (ZimbraWebClient - FF45 (Linux)/8.6.0_GA_1182)
Thread-Topic: nvme-rdma: support devices with queue size < 32
Thread-Index: UhwZZXXz61xqmldUUkbZF4GZti2fvw==
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signalling is done so
that it depends on the queue depth now. The magic define has
been removed completely. It also reworks the signalling
code to use atomic operations.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Samuel Jones <sjones@kalray.eu>
[v1]
---

Changes in v4:
* use atomic operations as suggested by Sagi

Changes in v3:
* avoid division in the fast path
* reverse sig_count logic to simplify the code: it now counts down
  from the queue depth/2 to 0
* change sig_count to int to avoid overflows for big queues

Changes in v2:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1
---
 drivers/nvme/host/rdma.c | 40 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 16f84eb..234b010 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
        struct nvme_rdma_qe     *rsp_ring;
-       u8                      sig_count;
+       atomic_t                sig_count;
        int                     queue_size;
        size_t                  cmnd_capsule_len;
        struct nvme_rdma_ctrl   *ctrl;
@@ -257,6 +257,15 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
        return queue->cm_error;
 }
 
+static inline int nvme_rdma_init_sig_count(int queue_size)
+{
+       /* We signal completion every queue depth/2 and also
+        * handle the case of possible device with queue_depth=1,
+        * where we would need to signal every message.
+        */
+       return max(queue_size / 2, 1);
+}
+
 static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 {
        struct nvme_rdma_device *dev = queue->device;
@@ -561,6 +570,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
        queue->queue_size = queue_size;
 
+       atomic_set(&queue->sig_count, nvme_rdma_init_sig_count(queue_size));
+
        queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
                        RDMA_PS_TCP, IB_QPT_RC);
        if (IS_ERR(queue->cm_id)) {
@@ -1029,6 +1040,28 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
                nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+       int v, old;
+
+       v = atomic_read(&queue->sig_count);
+       while (1) {
+               if (v > 1) {
+                       old = atomic_cmpxchg(&queue->sig_count, v, v - 1);
+                       if (old == v)
+                               return false;
+               } else {
+                       int new_count;
+
+                       new_count = nvme_rdma_init_sig_count(queue->queue_size);
+                       old = atomic_cmpxchg(&queue->sig_count, v, new_count);
+                       if (old == v)
+                               return true;
+               }
+               v = old;
+       }
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
                struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
                struct ib_send_wr *first, bool flush)
@@ -1056,9 +1089,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * Would have been way to obvious to handle this in hardware or
         * at least the RDMA stack..
         *
-        * This messy and racy code sniplet is copy and pasted from the iSER
-        * initiator, and the magic '32' comes from there as well.
-        *
         * Always signal the flushes. The magic request used for the flush
         * sequencer is not allocated in our driver's tagset and it's
         * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1066,7 +1096,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * embedded in request's payload, is not freed when __ib_process_cq()
         * calls wr_cqe->done().
         */
-       if ((++queue->sig_count % 32) == 0 || flush)
+       if (nvme_rdma_queue_sig_limit(queue) || flush)
                wr.send_flags |= IB_SEND_SIGNALED;
 
        if (first)