diff mbox

[v2,1/3] nvme-rdma: don't suppress send completions

Message ID 20171108100616.26605-2-sagi@grimberg.me (mailing list archive)
State Not Applicable
Headers show

Commit Message

Sagi Grimberg Nov. 8, 2017, 10:06 a.m. UTC
The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 42 ++++--------------------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

Comments

Christoph Hellwig Nov. 9, 2017, 9:18 a.m. UTC | #1
On Wed, Nov 08, 2017 at 12:06:14PM +0200, Sagi Grimberg wrote:
> The entire completions suppress mechanism is currently
> broken because the HCA might retry a send operation
> (due to dropped ack) after the nvme transaction has completed.
> 
> In order to handle this, we signal all send completions (besides
> async event which is not racing anything).

Oh well, so much for all these unsignalled completion optimizations..

So in which cases do unsignalled completions work at all?  Seems like
we need to fix up a lot of other ULPs as well.

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> -	 */
> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
> -		wr.send_flags |= IB_SEND_SIGNALED;
> +	wr.send_flags = IB_SEND_SIGNALED;

But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 9, 2017, 11:08 a.m. UTC | #2
>> The entire completions suppress mechanism is currently
>> broken because the HCA might retry a send operation
>> (due to dropped ack) after the nvme transaction has completed.
>>
>> In order to handle this, we signal all send completions (besides
>> async event which is not racing anything).
> 
> Oh well, so much for all these unsignalled completion optimizations..
> 
> So in which cases do unsignalled completions work at all?

Probably in cases where no in-capsule data is used we're fine because
the command buffers mappings are long lived, and for non-RPC like
applications.

> Seems like we need to fix up a lot of other ULPs as well.

Probably only those that support in-capsule data (which I hear
SRP does too these days).

>> -	 */
>> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
>> -		wr.send_flags |= IB_SEND_SIGNALED;
>> +	wr.send_flags = IB_SEND_SIGNALED;
> 
> But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?

There is one case I still don't want to signal send completion, thats
the AEN request. It doesn't have a request structure, and preferred not
to check on it for every send completion, and its not racing anything
(mentioned this above).

I can change that though if there is a strong desire.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 20, 2017, 8:18 a.m. UTC | #3
On Thu, Nov 09, 2017 at 01:08:04PM +0200, Sagi Grimberg wrote:
> Probably only those that support in-capsule data (which I hear
> SRP does too these days).

SRP-2 adds inline data, and Bart's out of tree code has been supporting
that for a long time.

> There is one case I still don't want to signal send completion, thats
> the AEN request. It doesn't have a request structure, and preferred not
> to check on it for every send completion, and its not racing anything
> (mentioned this above).
>
> I can change that though if there is a strong desire.

I don't really like having a special case just for this slow path
special case.  So if we can avoid it without too much overhead let's
do it, otherwise we can keep it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 20, 2017, 8:33 a.m. UTC | #4
>> There is one case I still don't want to signal send completion, thats
>> the AEN request. It doesn't have a request structure, and preferred not
>> to check on it for every send completion, and its not racing anything
>> (mentioned this above).
>>
>> I can change that though if there is a strong desire.
> 
> I don't really like having a special case just for this slow path
> special case.  So if we can avoid it without too much overhead let's
> do it, otherwise we can keep it as-is.

Saving the state of the request completions adds complication in
general, and we don't even have a request for AENs so it would mean to
keep it under the queue, and we don't really race anything because we
don't have inline data there. So I think its simpler to keep it as is.

Sending a v3 that relaxes the req->lock to spin_lock_bh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 20, 2017, 9:32 a.m. UTC | #5
On Mon, Nov 20, 2017 at 10:33:02AM +0200, Sagi Grimberg wrote:
>> I don't really like having a special case just for this slow path
>> special case.  So if we can avoid it without too much overhead let's
>> do it, otherwise we can keep it as-is.
>
> Saving the state of the request completions adds complication in
> general, and we don't even have a request for AENs so it would mean to
> keep it under the queue, and we don't really race anything because we
> don't have inline data there. So I think its simpler to keep it as is.

Ok, let's keep it.  But please add a comment explaining why the
non-signalled completions are fine for the AER but no one else.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ef7d27b63088..c765f1d20141 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -85,7 +85,6 @@  enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	atomic_t		sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -510,7 +509,6 @@  static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 		queue->cmnd_capsule_len = sizeof(struct nvme_command);
 
 	queue->queue_size = queue_size;
-	atomic_set(&queue->sig_count, 0);
 
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
@@ -1196,21 +1194,9 @@  static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1226,24 +1212,7 @@  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = IB_SEND_SIGNALED;
 
 	if (first)
 		first->next = &wr;
@@ -1317,7 +1286,7 @@  static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1602,7 +1571,6 @@  static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1634,10 +1602,8 @@  static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;