diff mbox series

[RFC] xprtrdma: Move initial Receive posting

Message ID 162922469165.515267.14848799693507242987.stgit@manet.1015granger.net (mailing list archive)
State Not Applicable
Headers show
Series [RFC] xprtrdma: Move initial Receive posting | expand

Commit Message

Chuck Lever III Aug. 17, 2021, 6:24 p.m. UTC
Håkon Bugge points out that rdma_create_qp() is not supposed to
return a QP that is ready for Receives to be posted. It so happens
that ours does that, but the IBTA spec (12.9.7.1) states that a
transition to INIT happens only after REQ has been sent.

In future kernels, QPs returned from rdma_create_qp() might not
be in a state where posting Receives will succeed. This patch
is a pre-requisite to changing the legacy behavior of
rdma_create_qp().

The first available moment after cm_send_req where xprtrdma can
post Receives is when the RDMA core reports the QP connection
has been established.

Note that xprtrdma has posted Receives just after rdma_create_qp()
since 8d4fb8ff427a ("xprtrdma: Fix disconnect regression"). To avoid
regressing 8d4fb8ff427a, xprtrdma needs to ensure that initial
Receive WRs are posted before pending RPCs are awoken. It appears
that the current logic does provide that guarantee.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Tom Talpey Aug. 18, 2021, 1:32 a.m. UTC | #1
> The first available moment after cm_send_req where xprtrdma can post
> Receives is when the RDMA core reports the QP connection has been
> established.

This has significant implications to upper layer protocols. It
means that the listener cannot safely send the first message on
a new connection. This constrains the upper layer protocol to
be client/server style, with the active-side ULP first to send.

If the CM is changed in the way described here, peer-to-peer
protocols will be rendered unusable, as the passive side cannot
reliably send the first message since the connection will have
no receive posted. The MPA Extensions in RFC6581 discuss this,
and support peer-to-peer connection models with the "A" flag
(section 9.2) enabling passive-first ULPs.

Posting receives and awakening client processing as proposed below
does not close this race. A passive-side-first protocol will have
already begun to send, regardless of this rearrangement. It's an
inherent race and will not interoperate reliably.

Why change the CM API? The IB spec is not authoritative on this,
and there currently is no bug, right?

Tom.

On 8/17/2021 2:24 PM, Chuck Lever wrote:
> Håkon Bugge points out that rdma_create_qp() is not supposed to 
> return a QP that is ready for Receives to be posted. It so happens 
> that ours does that, but the IBTA spec (12.9.7.1) states that a 
> transition to INIT happens only after REQ has been sent.
> 
> In future kernels, QPs returned from rdma_create_qp() might not be in
> a state where posting Receives will succeed. This patch is a
> pre-requisite to changing the legacy behavior of rdma_create_qp().
> 
> The first available moment after cm_send_req where xprtrdma can post
> Receives is when the RDMA core reports the QP connection has been
> established.
> 
> Note that xprtrdma has posted Receives just after rdma_create_qp() 
> since 8d4fb8ff427a ("xprtrdma: Fix disconnect regression"). To avoid 
> regressing 8d4fb8ff427a, xprtrdma needs to ensure that initial 
> Receive WRs are posted before pending RPCs are awoken. It appears 
> that the current logic does provide that guarantee.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- 
> net/sunrpc/xprtrdma/verbs.c |   12 ++++++------ 1 file changed, 6
> insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c
> b/net/sunrpc/xprtrdma/verbs.c index aaec3c9be8db..87ae62cdea18
> 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++
> b/net/sunrpc/xprtrdma/verbs.c @@ -520,12 +520,6 @@ int
> rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt) 
> xprt_clear_connected(xprt); rpcrdma_reset_cwnd(r_xprt);
> 
> -	/* Bump the ep's reference count while there are -	 * outstanding
> Receives. -	 */ -	rpcrdma_ep_get(ep); -	rpcrdma_post_recvs(r_xprt, 1,
> true); - rc = rdma_connect(ep->re_id, &ep->re_remote_cma); if (rc) 
> goto out; @@ -539,6 +533,12 @@ int rpcrdma_xprt_connect(struct
> rpcrdma_xprt *r_xprt) goto out; }
> 
> +	/* Bump the ep's reference count while there are +	 * outstanding
> Receives. +	 */ +	rpcrdma_ep_get(ep); +	rpcrdma_post_recvs(r_xprt, 1,
> true); + rc = rpcrdma_sendctxs_create(r_xprt); if (rc) { rc =
> -ENOTCONN;
> 
> 
>
Chuck Lever III Aug. 18, 2021, 2:32 a.m. UTC | #2
> On Aug 17, 2021, at 9:32 PM, Tom Talpey <tom@talpey.com> wrote:
> 
>> The first available moment after cm_send_req where xprtrdma can post
>> Receives is when the RDMA core reports the QP connection has been
>> established.

For reference, I should have included a link to Håkon's recent
remarks about this issue:

https://lore.kernel.org/linux-rdma/A3297641-63BA-4DF1-886A-3620E2A40BA3@oracle.com/


> This has significant implications to upper layer protocols. It
> means that the listener cannot safely send the first message on
> a new connection. This constrains the upper layer protocol to
> be client/server style, with the active-side ULP first to send.

The current situation is that the consumer doesn't get control
again until it gets RDMA_CM_EVENT_ESTABLISHED, which is a bit
late, I agree. We could plumb a callback that is invoked at
the "Send REQ" step for the purpose of populating the Receive
Queue on the active side.

Alternately, xprtrdma's connect worker can explicitly move a
new QP to INIT state if it wants to post Receives early. I
don't see a hard requirement for rdma_create_qp() to do that
for us.


> If the CM is changed in the way described here, peer-to-peer
> protocols will be rendered unusable, as the passive side cannot
> reliably send the first message since the connection will have
> no receive posted. The MPA Extensions in RFC6581 discuss this,
> and support peer-to-peer connection models with the "A" flag
> (section 9.2) enabling passive-first ULPs.

At the moment, possibly the only passive-first ULP in the
Linux kernel is RDS.


> Posting receives and awakening client processing as proposed below
> does not close this race. A passive-side-first protocol will have
> already begun to send, regardless of this rearrangement. It's an
> inherent race and will not interoperate reliably.

Sure, but as far as I can tell, RPC-over-RDMA v1 does not
have an issue here?


> Why change the CM API? The IB spec is not authoritative on this,
> and there currently is no bug, right?

The minor bug is described in the link above.


--
Chuck Lever
diff mbox series

Patch

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aaec3c9be8db..87ae62cdea18 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -520,12 +520,6 @@  int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt)
 	xprt_clear_connected(xprt);
 	rpcrdma_reset_cwnd(r_xprt);
 
-	/* Bump the ep's reference count while there are
-	 * outstanding Receives.
-	 */
-	rpcrdma_ep_get(ep);
-	rpcrdma_post_recvs(r_xprt, 1, true);
-
 	rc = rdma_connect(ep->re_id, &ep->re_remote_cma);
 	if (rc)
 		goto out;
@@ -539,6 +533,12 @@  int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt)
 		goto out;
 	}
 
+	/* Bump the ep's reference count while there are
+	 * outstanding Receives.
+	 */
+	rpcrdma_ep_get(ep);
+	rpcrdma_post_recvs(r_xprt, 1, true);
+
 	rc = rpcrdma_sendctxs_create(r_xprt);
 	if (rc) {
 		rc = -ENOTCONN;