From patchwork Wed Feb 20 14:57:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yishai Hadas <yishaih@mellanox.com>
X-Patchwork-Id: 10822271
Return-Path: <linux-rdma-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60A081575
	for <patchwork-linux-rdma@patchwork.kernel.org>;
 Wed, 20 Feb 2019 14:58:04 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 488A22E5FE
	for <patchwork-linux-rdma@patchwork.kernel.org>;
 Wed, 20 Feb 2019 14:58:04 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3CB792E607; Wed, 20 Feb 2019 14:58:04 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8D0072E7DC
	for <patchwork-linux-rdma@patchwork.kernel.org>;
 Wed, 20 Feb 2019 14:58:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726757AbfBTO6C (ORCPT
        <rfc822;patchwork-linux-rdma@patchwork.kernel.org>);
        Wed, 20 Feb 2019 09:58:02 -0500
Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:33031 "EHLO
        mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
        with ESMTP id S1726524AbfBTO6C (ORCPT
        <rfc822;linux-rdma@vger.kernel.org>); Wed, 20 Feb 2019 09:58:02 -0500
Received: from Internal Mail-Server by MTLPINE1 (envelope-from
 yishaih@mellanox.com)
        with ESMTPS (AES256-SHA encrypted); 20 Feb 2019 16:57:55 +0200
Received: from vnc17.mtl.labs.mlnx (vnc17.mtl.labs.mlnx [10.7.2.17])
        by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id x1KEvtsi030857;
        Wed, 20 Feb 2019 16:57:55 +0200
Received: from vnc17.mtl.labs.mlnx (vnc17.mtl.labs.mlnx [127.0.0.1])
        by vnc17.mtl.labs.mlnx (8.13.8/8.13.8) with ESMTP id x1KEvtrG013391;
        Wed, 20 Feb 2019 16:57:55 +0200
Received: (from yishaih@localhost)
        by vnc17.mtl.labs.mlnx (8.13.8/8.13.8/Submit) id x1KEvtMf013390;
        Wed, 20 Feb 2019 16:57:55 +0200
From: Yishai Hadas <yishaih@mellanox.com>
To: linux-rdma@vger.kernel.org
Cc: yishaih@mellanox.com, monis@mellanox.com, artemyko@mellanox.com,
        jgg@mellanox.com, majd@mellanox.com
Subject: [PATCH rdma-core 3/6] mlx5: Introduce a wait queue for SRQ WQEs
Date: Wed, 20 Feb 2019 16:57:35 +0200
Message-Id: <1550674658-13295-4-git-send-email-yishaih@mellanox.com>
X-Mailer: git-send-email 1.8.2.3
In-Reply-To: <1550674658-13295-1-git-send-email-yishaih@mellanox.com>
References: <1550674658-13295-1-git-send-email-yishaih@mellanox.com>
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Moni Shoua <monis@mellanox.com>

When allocating the WQE buffer try to allocate more space than required.
The extra space will serve as a place for WQEs that were recently
switched from HW ownership to SW ownership to cool down before being
posted again. This is useful with WQEs with ODP buffers that were
consumed by HW but weren't handled yet by the page-fault handler in
kernel.

The policy of the wait queue is FIFO so a WQE gets out the wait queue
after N-1 WQEs get in when N is the size of the wait queue.

WQEs in the wait queue are considered to be in SW ownership except they
are not counted as candidates for posting. This means that WQEs in the
wait queue aren't in HW ownership while there.

Putting a WQE in the wait queue means that it's no longer available for
posting. When that happens, another WQE in the wait queue needs to be
taken out of there to replace it.

Having a wait queue is not mandatory. If the extra resources that are
required for the wait queue are beyond the limits of the system then the
SRQ will operate without a wait queue.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
---
 providers/mlx5/mlx5.h  | 15 +++++++++++-
 providers/mlx5/srq.c   | 63 +++++++++++++++++++++++++++++++++++++++-----------
 providers/mlx5/verbs.c | 33 ++++++++++++++++++--------
 3 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index 75d599a..f315f63 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -415,6 +415,8 @@ struct mlx5_srq {
 	int				wqe_shift;
 	int				head;
 	int				tail;
+	int				waitq_head;
+	int				waitq_tail;
 	__be32			       *db;
 	uint16_t			counter;
 	int				wq_sig;
@@ -807,7 +809,8 @@ int mlx5_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr,
 int mlx5_query_srq(struct ibv_srq *srq,
 			   struct ibv_srq_attr *attr);
 int mlx5_destroy_srq(struct ibv_srq *srq);
-int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq);
+int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq,
+		       uint32_t nwr);
 void mlx5_free_srq_wqe(struct mlx5_srq *srq, int ind);
 int mlx5_post_srq_recv(struct ibv_srq *ibsrq,
 		       struct ibv_recv_wr *wr,
@@ -1017,4 +1020,14 @@ static inline uint8_t calc_sig(void *wqe, int size)
 	return ~res;
 }
 
+static inline int align_queue_size(long long req)
+{
+	return mlx5_round_up_power_of_two(req);
+}
+
+static inline bool srq_has_waitq(struct mlx5_srq *srq)
+{
+	return srq->waitq_head >= 0;
+}
+
 #endif /* MLX5_H */
diff --git a/providers/mlx5/srq.c b/providers/mlx5/srq.c
index 94528bb..a2d37d0 100644
--- a/providers/mlx5/srq.c
+++ b/providers/mlx5/srq.c
@@ -145,13 +145,29 @@ int mlx5_post_srq_recv(struct ibv_srq *ibsrq,
 	return err;
 }
 
-int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq)
+/* Build a linked list on an array of SRQ WQEs.
+ * Since WQEs are always added to the tail and taken from the head
+ * it doesn't matter where the last WQE points to.
+ */
+static void set_srq_buf_ll(struct mlx5_srq *srq, int start, int end)
 {
 	struct mlx5_wqe_srq_next_seg *next;
+	int i;
+
+	for (i = start; i < end; ++i) {
+		next = get_wqe(srq, i);
+		next->next_wqe_index = htobe16(i + 1);
+	}
+}
+
+int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq,
+		       uint32_t max_wr)
+{
 	int size;
 	int buf_size;
-	int i;
 	struct mlx5_context	   *ctx;
+	uint32_t orig_max_wr = max_wr;
+	bool have_wq = true;
 
 	ctx = to_mctx(context);
 
@@ -160,9 +176,18 @@ int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq)
 		return -1;
 	}
 
-	srq->wrid = malloc(srq->max * sizeof *srq->wrid);
-	if (!srq->wrid)
-		return -1;
+	/* At first, try to allocate more WQEs than requested so the extra will
+	 * be used for the wait queue.
+	 */
+	max_wr = orig_max_wr * 2 + 1;
+
+	if (max_wr > ctx->max_srq_recv_wr) {
+		/* Device limits are smaller than required
+		 * to provide a wait queue, continue without.
+		 */
+		max_wr = orig_max_wr + 1;
+		have_wq = false;
+	}
 
 	size = sizeof(struct mlx5_wqe_srq_next_seg) +
 		srq->max_gs * sizeof(struct mlx5_wqe_data_seg);
@@ -179,14 +204,28 @@ int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq)
 
 	srq->wqe_shift = mlx5_ilog2(size);
 
+	srq->max = align_queue_size(max_wr);
 	buf_size = srq->max * size;
 
 	if (mlx5_alloc_buf(&srq->buf, buf_size,
-			   to_mdev(context->device)->page_size)) {
-		free(srq->wrid);
+			   to_mdev(context->device)->page_size))
 		return -1;
+
+	srq->head = 0;
+	srq->tail = align_queue_size(orig_max_wr + 1) - 1;
+	if (have_wq)  {
+		srq->waitq_head = srq->tail + 1;
+		srq->waitq_tail = srq->max - 1;
+	} else {
+		srq->waitq_head = -1;
+		srq->waitq_tail = -1;
 	}
 
+	srq->wrid = malloc(srq->max * sizeof(*srq->wrid));
+	if (!srq->wrid) {
+		mlx5_free_buf(&srq->buf);
+		return -1;
+	}
 	memset(srq->buf.buf, 0, buf_size);
 
 	/*
@@ -194,13 +233,9 @@ int mlx5_alloc_srq_buf(struct ibv_context *context, struct mlx5_srq *srq)
 	 * linked into the list of free WQEs.
 	 */
 
-	for (i = 0; i < srq->max; ++i) {
-		next = get_wqe(srq, i);
-		next->next_wqe_index = htobe16((i + 1) & (srq->max - 1));
-	}
-
-	srq->head = 0;
-	srq->tail = srq->max - 1;
+	set_srq_buf_ll(srq, srq->head, srq->tail);
+	if (have_wq)
+		set_srq_buf_ll(srq, srq->waitq_head, srq->waitq_tail);
 
 	return 0;
 }
diff --git a/providers/mlx5/verbs.c b/providers/mlx5/verbs.c
index 7e1c125..2bccdf8 100644
--- a/providers/mlx5/verbs.c
+++ b/providers/mlx5/verbs.c
@@ -553,11 +553,6 @@ int mlx5_round_up_power_of_two(long long sz)
 	return (int)ret;
 }
 
-static int align_queue_size(long long req)
-{
-	return mlx5_round_up_power_of_two(req);
-}
-
 static int get_cqe_size(struct mlx5dv_cq_init_attr *mlx5cq_attr)
 {
 	char *env;
@@ -1016,11 +1011,10 @@ struct ibv_srq *mlx5_create_srq(struct ibv_pd *pd,
 		goto err;
 	}
 
-	srq->max     = align_queue_size(attr->attr.max_wr + 1);
 	srq->max_gs  = attr->attr.max_sge;
 	srq->counter = 0;
 
-	if (mlx5_alloc_srq_buf(pd->context, srq)) {
+	if (mlx5_alloc_srq_buf(pd->context, srq, attr->attr.max_wr)) {
 		fprintf(stderr, "%s-%d:\n", __func__, __LINE__);
 		goto err;
 	}
@@ -1041,11 +1035,22 @@ struct ibv_srq *mlx5_create_srq(struct ibv_pd *pd,
 
 	attr->attr.max_sge = srq->max_gs;
 	pthread_mutex_lock(&ctx->srq_table_mutex);
+
+	/* Override max_wr to let kernel know about extra WQEs for the
+	 * wait queue.
+	 */
+	attr->attr.max_wr = srq->max - 1;
+
 	ret = ibv_cmd_create_srq(pd, ibsrq, attr, &cmd.ibv_cmd, sizeof(cmd),
 				 &resp.ibv_resp, sizeof(resp));
 	if (ret)
 		goto err_db;
 
+	/* Override kernel response that includes the wait queue with the real
+	 * number of WQEs that are applicable for the application.
+	 */
+	attr->attr.max_wr = srq->tail;
+
 	ret = mlx5_store_srq(ctx, resp.srqn, srq);
 	if (ret)
 		goto err_destroy;
@@ -2707,11 +2712,10 @@ struct ibv_srq *mlx5_create_srq_ex(struct ibv_context *context,
 		goto err;
 	}
 
-	msrq->max     = align_queue_size(attr->attr.max_wr + 1);
 	msrq->max_gs  = attr->attr.max_sge;
 	msrq->counter = 0;
 
-	if (mlx5_alloc_srq_buf(context, msrq)) {
+	if (mlx5_alloc_srq_buf(context, msrq, attr->attr.max_wr)) {
 		fprintf(stderr, "%s-%d:\n", __func__, __LINE__);
 		goto err;
 	}
@@ -2743,9 +2747,20 @@ struct ibv_srq *mlx5_create_srq_ex(struct ibv_context *context,
 		pthread_mutex_lock(&ctx->srq_table_mutex);
 	}
 
+	/* Override max_wr to let kernel know about extra WQEs for the
+	 * wait queue.
+	 */
+	attr->attr.max_wr = msrq->max - 1;
+
 	err = ibv_cmd_create_srq_ex(context, &msrq->vsrq, sizeof(msrq->vsrq),
 				    attr, &cmd.ibv_cmd, sizeof(cmd),
 				    &resp.ibv_resp, sizeof(resp));
+
+	/* Override kernel response that includes the wait queue with the real
+	 * number of WQEs that are applicable for the application.
+	 */
+	attr->attr.max_wr = msrq->tail;
+
 	if (err)
 		goto err_free_uidx;