Message ID | 162922469165.515267.14848799693507242987.stgit@manet.1015granger.net (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [RFC] xprtrdma: Move initial Receive posting | expand |
> The first available moment after cm_send_req where xprtrdma can post > Receives is when the RDMA core reports the QP connection has been > established. This has significant implications to upper layer protocols. It means that the listener cannot safely send the first message on a new connection. This constrains the upper layer protocol to be client/server style, with the active-side ULP first to send. If the CM is changed in the way described here, peer-to-peer protocols will be rendered unusable, as the passive side cannot reliably send the first message since the connection will have no receive posted. The MPA Extensions in RFC6581 discuss this, and support peer-to-peer connection models with the "A" flag (section 9.2) enabling passive-first ULPs. Posting receives and awakening client processing as proposed below does not close this race. A passive-side-first protocol will have already begun to send, regardless of this rearrangement. It's an inherent race and will not interoperate reliably. Why change the CM API? The IB spec is not authoritative on this, and there currently is no bug, right? Tom. On 8/17/2021 2:24 PM, Chuck Lever wrote: > Håkon Bugge points out that rdma_create_qp() is not supposed to > return a QP that is ready for Receives to be posted. It so happens > that ours does that, but the IBTA spec (12.9.7.1) states that a > transition to INIT happens only after REQ has been sent. > > In future kernels, QPs returned from rdma_create_qp() might not be in > a state where posting Receives will succeed. This patch is a > pre-requisite to changing the legacy behavior of rdma_create_qp(). > > The first available moment after cm_send_req where xprtrdma can post > Receives is when the RDMA core reports the QP connection has been > established. > > Note that xprtrdma has posted Receives just after rdma_create_qp() > since 8d4fb8ff427a ("xprtrdma: Fix disconnect regression"). To avoid > regressing 8d4fb8ff427a, xprtrdma needs to ensure that initial > Receive WRs are posted before pending RPCs are awoken. It appears > that the current logic does provide that guarantee. > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- > net/sunrpc/xprtrdma/verbs.c | 12 ++++++------ 1 file changed, 6 > insertions(+), 6 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/verbs.c > b/net/sunrpc/xprtrdma/verbs.c index aaec3c9be8db..87ae62cdea18 > 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ > b/net/sunrpc/xprtrdma/verbs.c @@ -520,12 +520,6 @@ int > rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt) > xprt_clear_connected(xprt); rpcrdma_reset_cwnd(r_xprt); > > - /* Bump the ep's reference count while there are - * outstanding > Receives. - */ - rpcrdma_ep_get(ep); - rpcrdma_post_recvs(r_xprt, 1, > true); - rc = rdma_connect(ep->re_id, &ep->re_remote_cma); if (rc) > goto out; @@ -539,6 +533,12 @@ int rpcrdma_xprt_connect(struct > rpcrdma_xprt *r_xprt) goto out; } > > + /* Bump the ep's reference count while there are + * outstanding > Receives. + */ + rpcrdma_ep_get(ep); + rpcrdma_post_recvs(r_xprt, 1, > true); + rc = rpcrdma_sendctxs_create(r_xprt); if (rc) { rc = > -ENOTCONN; > > >
> On Aug 17, 2021, at 9:32 PM, Tom Talpey <tom@talpey.com> wrote: > >> The first available moment after cm_send_req where xprtrdma can post >> Receives is when the RDMA core reports the QP connection has been >> established. For reference, I should have included a link to Håkon's recent remarks about this issue: https://lore.kernel.org/linux-rdma/A3297641-63BA-4DF1-886A-3620E2A40BA3@oracle.com/ > This has significant implications to upper layer protocols. It > means that the listener cannot safely send the first message on > a new connection. This constrains the upper layer protocol to > be client/server style, with the active-side ULP first to send. The current situation is that the consumer doesn't get control again until it gets RDMA_CM_EVENT_ESTABLISHED, which is a bit late, I agree. We could plumb a callback that is invoked at the "Send REQ" step for the purpose of populating the Receive Queue on the active side. Alternately, xprtrdma's connect worker can explicitly move a new QP to INIT state if it wants to post Receives early. I don't see a hard requirement for rdma_create_qp() to do that for us. > If the CM is changed in the way described here, peer-to-peer > protocols will be rendered unusable, as the passive side cannot > reliably send the first message since the connection will have > no receive posted. The MPA Extensions in RFC6581 discuss this, > and support peer-to-peer connection models with the "A" flag > (section 9.2) enabling passive-first ULPs. At the moment, possibly the only passive-first ULP in the Linux kernel is RDS. > Posting receives and awakening client processing as proposed below > does not close this race. A passive-side-first protocol will have > already begun to send, regardless of this rearrangement. It's an > inherent race and will not interoperate reliably. Sure, but as far as I can tell, RPC-over-RDMA v1 does not have an issue here? > Why change the CM API? The IB spec is not authoritative on this, > and there currently is no bug, right? The minor bug is described in the link above. -- Chuck Lever
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index aaec3c9be8db..87ae62cdea18 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -520,12 +520,6 @@ int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt) xprt_clear_connected(xprt); rpcrdma_reset_cwnd(r_xprt); - /* Bump the ep's reference count while there are - * outstanding Receives. - */ - rpcrdma_ep_get(ep); - rpcrdma_post_recvs(r_xprt, 1, true); - rc = rdma_connect(ep->re_id, &ep->re_remote_cma); if (rc) goto out; @@ -539,6 +533,12 @@ int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt) goto out; } + /* Bump the ep's reference count while there are + * outstanding Receives. + */ + rpcrdma_ep_get(ep); + rpcrdma_post_recvs(r_xprt, 1, true); + rc = rpcrdma_sendctxs_create(r_xprt); if (rc) { rc = -ENOTCONN;
Håkon Bugge points out that rdma_create_qp() is not supposed to return a QP that is ready for Receives to be posted. It so happens that ours does that, but the IBTA spec (12.9.7.1) states that a transition to INIT happens only after REQ has been sent. In future kernels, QPs returned from rdma_create_qp() might not be in a state where posting Receives will succeed. This patch is a pre-requisite to changing the legacy behavior of rdma_create_qp(). The first available moment after cm_send_req where xprtrdma can post Receives is when the RDMA core reports the QP connection has been established. Note that xprtrdma has posted Receives just after rdma_create_qp() since 8d4fb8ff427a ("xprtrdma: Fix disconnect regression"). To avoid regressing 8d4fb8ff427a, xprtrdma needs to ensure that initial Receive WRs are posted before pending RPCs are awoken. It appears that the current logic does provide that guarantee. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- net/sunrpc/xprtrdma/verbs.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)