diff mbox series

[v3,15/26] xprtrdma: Do not recycle MR after FastReg/LocalInv flushes

Message ID 161885539285.38598.13978652738422395833.stgit@manet.1015granger.net (mailing list archive)
State Not Applicable
Headers show
Series NFS/RDMA client patches for next | expand

Commit Message

Chuck Lever III April 19, 2021, 6:03 p.m. UTC
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.

I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h |    1 -
 net/sunrpc/xprtrdma/frwr_ops.c |   69 ++++++++++------------------------------
 2 files changed, 17 insertions(+), 53 deletions(-)

Comments

Dan Aloni April 25, 2021, 2:19 p.m. UTC | #1
On Mon, Apr 19, 2021 at 02:03:12PM -0400, Chuck Lever wrote:
> Better not to touch MRs involved in a flush or post error until the
> Send and Receive Queues are drained and the transport is fully
> quiescent. Simply don't insert such MRs back onto the free list.
> They remain on mr_all and will be released when the connection is
> torn down.
> 
> I had thought that recycling would prevent hardware resources from
> being tied up for a long time. However, since v5.7, a transport
> disconnect destroys the QP and other hardware-owned resources. The
> MRs get cleaned up nicely at that point.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Is this a fix for the crash below?  I just wonder if it appeared for
others in the wild, and the fix is not just theoretical.

    WARNING: CPU: 5 PID: 20312 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0
    list_del corruption, ffff9df150b06768->next is LIST_POISON1 (dead000000000100)

    Call Trace:
     [<ffffffff99764147>] dump_stack+0x19/0x1b
     [<ffffffff99098848>] __warn+0xd8/0x100
     [<ffffffff990988cf>] warn_slowpath_fmt+0x5f/0x80
     [<ffffffff9921d5f6>] ? kfree+0x106/0x140
     [<ffffffff99396953>] __list_del_entry+0x63/0xd0
     [<ffffffff993969cd>] list_del+0xd/0x30
     [<ffffffffc0bb307f>] frwr_mr_recycle+0xaf/0x150 [rpcrdma]
     [<ffffffffc0bb3264>] frwr_wc_localinv+0x94/0xa0 [rpcrdma]
     [<ffffffffc067d20e>] __ib_process_cq+0x8e/0x100 [ib_core]
     [<ffffffffc067d2f9>] ib_cq_poll_work+0x29/0x70 [ib_core]
     [<ffffffff990baf9f>] process_one_work+0x17f/0x440
     [<ffffffff990bc036>] worker_thread+0x126/0x3c0
     [<ffffffff990bbf10>] ? manage_workers.isra.25+0x2a0/0x2a0
     [<ffffffff990c2e81>] kthread+0xd1/0xe0
     [<ffffffff990c2db0>] ? insert_kthread_work+0x40/0x40
     [<ffffffff99776c37>] ret_from_fork_nospec_begin+0x21/0x21
     [<ffffffff990c2db0>] ? insert_kthread_work+0x40/0x40
Chuck Lever III April 25, 2021, 4:21 p.m. UTC | #2
> On Apr 25, 2021, at 10:19 AM, Dan Aloni <dan@kernelim.com> wrote:
> 
> On Mon, Apr 19, 2021 at 02:03:12PM -0400, Chuck Lever wrote:
>> Better not to touch MRs involved in a flush or post error until the
>> Send and Receive Queues are drained and the transport is fully
>> quiescent. Simply don't insert such MRs back onto the free list.
>> They remain on mr_all and will be released when the connection is
>> torn down.
>> 
>> I had thought that recycling would prevent hardware resources from
>> being tied up for a long time. However, since v5.7, a transport
>> disconnect destroys the QP and other hardware-owned resources. The
>> MRs get cleaned up nicely at that point.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> Is this a fix for the crash below?

Yes, it is plausible. That is a familiar backtrace.

However, it's usually because the provider called the LocalInv
completion handler twice for the same CQE. Which provider is this?


> I just wonder if it appeared for
> others in the wild, and the fix is not just theoretical.
> 
>    WARNING: CPU: 5 PID: 20312 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0
>    list_del corruption, ffff9df150b06768->next is LIST_POISON1 (dead000000000100)
> 
>    Call Trace:
>     [<ffffffff99764147>] dump_stack+0x19/0x1b
>     [<ffffffff99098848>] __warn+0xd8/0x100
>     [<ffffffff990988cf>] warn_slowpath_fmt+0x5f/0x80
>     [<ffffffff9921d5f6>] ? kfree+0x106/0x140
>     [<ffffffff99396953>] __list_del_entry+0x63/0xd0
>     [<ffffffff993969cd>] list_del+0xd/0x30
>     [<ffffffffc0bb307f>] frwr_mr_recycle+0xaf/0x150 [rpcrdma]
>     [<ffffffffc0bb3264>] frwr_wc_localinv+0x94/0xa0 [rpcrdma]
>     [<ffffffffc067d20e>] __ib_process_cq+0x8e/0x100 [ib_core]
>     [<ffffffffc067d2f9>] ib_cq_poll_work+0x29/0x70 [ib_core]
>     [<ffffffff990baf9f>] process_one_work+0x17f/0x440
>     [<ffffffff990bc036>] worker_thread+0x126/0x3c0
>     [<ffffffff990bbf10>] ? manage_workers.isra.25+0x2a0/0x2a0
>     [<ffffffff990c2e81>] kthread+0xd1/0xe0
>     [<ffffffff990c2db0>] ? insert_kthread_work+0x40/0x40
>     [<ffffffff99776c37>] ret_from_fork_nospec_begin+0x21/0x21
>     [<ffffffff990c2db0>] ? insert_kthread_work+0x40/0x40
> 
> -- 
> Dan Aloni

--
Chuck Lever
Dan Aloni April 25, 2021, 5 p.m. UTC | #3
On Sun, Apr 25, 2021 at 04:21:03PM +0000, Chuck Lever III wrote:
> 
> 
> > On Apr 25, 2021, at 10:19 AM, Dan Aloni <dan@kernelim.com> wrote:
> > 
> > On Mon, Apr 19, 2021 at 02:03:12PM -0400, Chuck Lever wrote:
> >> Better not to touch MRs involved in a flush or post error until the
> >> Send and Receive Queues are drained and the transport is fully
> >> quiescent. Simply don't insert such MRs back onto the free list.
> >> They remain on mr_all and will be released when the connection is
> >> torn down.
> >> 
> >> I had thought that recycling would prevent hardware resources from
> >> being tied up for a long time. However, since v5.7, a transport
> >> disconnect destroys the QP and other hardware-owned resources. The
> >> MRs get cleaned up nicely at that point.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Is this a fix for the crash below?
> 
> Yes, it is plausible. That is a familiar backtrace.
> 
> However, it's usually because the provider called the LocalInv
> completion handler twice for the same CQE. Which provider is this?

It's mlx5 driver, ConnectX-4 (MT27700).
diff mbox series

Patch

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index c838e7ac1c2d..e38e745d13b0 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -1014,7 +1014,6 @@  DEFINE_MR_EVENT(localinv);
 DEFINE_MR_EVENT(map);
 
 DEFINE_ANON_MR_EVENT(unmap);
-DEFINE_ANON_MR_EVENT(recycle);
 
 TRACE_EVENT(xprtrdma_dma_maperr,
 	TP_PROTO(
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index af85cec0ce31..27087dc8ba3c 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -49,6 +49,16 @@ 
 # define RPCDBG_FACILITY	RPCDBG_TRANS
 #endif
 
+static void frwr_mr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
+{
+	if (mr->mr_device) {
+		trace_xprtrdma_mr_unmap(mr);
+		ib_dma_unmap_sg(mr->mr_device, mr->mr_sg, mr->mr_nents,
+				mr->mr_dir);
+		mr->mr_device = NULL;
+	}
+}
+
 /**
  * frwr_mr_release - Destroy one MR
  * @mr: MR allocated by frwr_mr_init
@@ -58,6 +68,8 @@  void frwr_mr_release(struct rpcrdma_mr *mr)
 {
 	int rc;
 
+	frwr_mr_unmap(mr->mr_xprt, mr);
+
 	rc = ib_dereg_mr(mr->frwr.fr_mr);
 	if (rc)
 		trace_xprtrdma_frwr_dereg(mr, rc);
@@ -65,32 +77,6 @@  void frwr_mr_release(struct rpcrdma_mr *mr)
 	kfree(mr);
 }
 
-static void frwr_mr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
-{
-	if (mr->mr_device) {
-		trace_xprtrdma_mr_unmap(mr);
-		ib_dma_unmap_sg(mr->mr_device, mr->mr_sg, mr->mr_nents,
-				mr->mr_dir);
-		mr->mr_device = NULL;
-	}
-}
-
-static void frwr_mr_recycle(struct rpcrdma_mr *mr)
-{
-	struct rpcrdma_xprt *r_xprt = mr->mr_xprt;
-
-	trace_xprtrdma_mr_recycle(mr);
-
-	frwr_mr_unmap(r_xprt, mr);
-
-	spin_lock(&r_xprt->rx_buf.rb_lock);
-	list_del(&mr->mr_all);
-	r_xprt->rx_stats.mrs_recycled++;
-	spin_unlock(&r_xprt->rx_buf.rb_lock);
-
-	frwr_mr_release(mr);
-}
-
 static void frwr_mr_put(struct rpcrdma_mr *mr)
 {
 	frwr_mr_unmap(mr->mr_xprt, mr);
@@ -365,6 +351,7 @@  struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
  * @cq: completion queue
  * @wc: WCE for a completed FastReg WR
  *
+ * Each flushed MR gets destroyed after the QP has drained.
  */
 static void frwr_wc_fastreg(struct ib_cq *cq, struct ib_wc *wc)
 {
@@ -374,7 +361,6 @@  static void frwr_wc_fastreg(struct ib_cq *cq, struct ib_wc *wc)
 
 	/* WARNING: Only wr_cqe and status are reliable at this point */
 	trace_xprtrdma_wc_fastreg(wc, &frwr->fr_cid);
-	/* The MR will get recycled when the associated req is retransmitted */
 
 	rpcrdma_flush_disconnect(cq->cq_context, wc);
 }
@@ -448,9 +434,7 @@  void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs)
 
 static void frwr_mr_done(struct ib_wc *wc, struct rpcrdma_mr *mr)
 {
-	if (wc->status != IB_WC_SUCCESS)
-		frwr_mr_recycle(mr);
-	else
+	if (likely(wc->status == IB_WC_SUCCESS))
 		frwr_mr_put(mr);
 }
 
@@ -567,17 +551,8 @@  void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
 	if (!rc)
 		return;
 
-	/* Recycle MRs in the LOCAL_INV chain that did not get posted.
-	 */
+	/* On error, the MRs get destroyed once the QP has drained. */
 	trace_xprtrdma_post_linv_err(req, rc);
-	while (bad_wr) {
-		frwr = container_of(bad_wr, struct rpcrdma_frwr,
-				    fr_invwr);
-		mr = container_of(frwr, struct rpcrdma_mr, frwr);
-		bad_wr = bad_wr->next;
-
-		frwr_mr_recycle(mr);
-	}
 }
 
 /**
@@ -621,7 +596,6 @@  void frwr_unmap_async(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
 {
 	struct ib_send_wr *first, *last, **prev;
 	struct rpcrdma_ep *ep = r_xprt->rx_ep;
-	const struct ib_send_wr *bad_wr;
 	struct rpcrdma_frwr *frwr;
 	struct rpcrdma_mr *mr;
 	int rc;
@@ -663,21 +637,12 @@  void frwr_unmap_async(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
 	 * replaces the QP. The RPC reply handler won't call us
 	 * unless re_id->qp is a valid pointer.
 	 */
-	bad_wr = NULL;
-	rc = ib_post_send(ep->re_id->qp, first, &bad_wr);
+	rc = ib_post_send(ep->re_id->qp, first, NULL);
 	if (!rc)
 		return;
 
-	/* Recycle MRs in the LOCAL_INV chain that did not get posted.
-	 */
+	/* On error, the MRs get destroyed once the QP has drained. */
 	trace_xprtrdma_post_linv_err(req, rc);
-	while (bad_wr) {
-		frwr = container_of(bad_wr, struct rpcrdma_frwr, fr_invwr);
-		mr = container_of(frwr, struct rpcrdma_mr, frwr);
-		bad_wr = bad_wr->next;
-
-		frwr_mr_recycle(mr);
-	}
 
 	/* The final LOCAL_INV WR in the chain is supposed to
 	 * do the wake. If it was never posted, the wake will