xprtrdma: Fix disconnect regression
diff mbox series

Message ID 20180728144051.3612.82362.stgit@manet.1015granger.net
State New
Headers show
Series
  • xprtrdma: Fix disconnect regression
Related show

Commit Message

Chuck Lever July 28, 2018, 2:46 p.m. UTC
I found that injecting disconnects with v4.18-rc resulted in
random failures of the multi-threaded git regression test.

The root cause appears to be that, after a reconnect, the
RPC/RDMA transport is waking pending RPCs before the transport has
posted enough Receive buffers to receive the Replies. If a Reply
arrives before enough Receive buffers are posted, the connection
is dropped. A few connection drops happen in quick succession as
the client and server struggle to regain credit synchronization.

This regression was introduced with commit 7c8d9e7c8863 ("xprtrdma:
Move Receive posting to Receive handler"). The client is supposed to
post a single Receive when a connection is established because
it's not supposed to send more than one RPC Call before it gets
a fresh credit grant in the first RPC Reply [RFC 8166, Section
3.3.3].

Unfortunately there appears to be a longstanding bug in the Linux
client's credit accounting mechanism. On connect, it simply dumps
all pending RPC Calls onto the new connection. It's possible it has
done this ever since the RPC/RDMA transport was added to the kernel
ten years ago.

Servers have so far been tolerant of this bad behavior. Currently no
server implementation ever changes its credit grant over reconnects,
and servers always repost enough Receives before connections are
fully established.

The Linux client implementation used to post a Receive before each
of these Calls. This has covered up the flooding send behavior.

I could try to correct this old bug so that the client sends exactly
one RPC Call and waits for a Reply. Since we are so close to the
next merge window, I'm going to instead provide a simple patch to
post enough Receives before a reconnect completes (based on the
number of credits granted to the previous connection).

The spurious disconnects will be gone, but the client will still
send multiple RPC Calls immediately after a reconnect.

Addressing the latter problem will wait for a merge window because
a) I expect it to be a large change requiring lots of testing, and
b) obviously the Linux client has interoperated successfully since
day zero while still being broken.

Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Hi Anna-

Would it be possible to get this into linux-next and then v4.18-rc ?
I know this is very late, but I found this issue only recently and
it took a few days to nail down a simple fix.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever Aug. 6, 2018, 2 p.m. UTC | #1
> On Jul 28, 2018, at 10:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> I found that injecting disconnects with v4.18-rc resulted in
> random failures of the multi-threaded git regression test.
> 
> The root cause appears to be that, after a reconnect, the
> RPC/RDMA transport is waking pending RPCs before the transport has
> posted enough Receive buffers to receive the Replies. If a Reply
> arrives before enough Receive buffers are posted, the connection
> is dropped. A few connection drops happen in quick succession as
> the client and server struggle to regain credit synchronization.
> 
> This regression was introduced with commit 7c8d9e7c8863 ("xprtrdma:
> Move Receive posting to Receive handler"). The client is supposed to
> post a single Receive when a connection is established because
> it's not supposed to send more than one RPC Call before it gets
> a fresh credit grant in the first RPC Reply [RFC 8166, Section
> 3.3.3].
> 
> Unfortunately there appears to be a longstanding bug in the Linux
> client's credit accounting mechanism. On connect, it simply dumps
> all pending RPC Calls onto the new connection. It's possible it has
> done this ever since the RPC/RDMA transport was added to the kernel
> ten years ago.
> 
> Servers have so far been tolerant of this bad behavior. Currently no
> server implementation ever changes its credit grant over reconnects,
> and servers always repost enough Receives before connections are
> fully established.
> 
> The Linux client implementation used to post a Receive before each
> of these Calls. This has covered up the flooding send behavior.
> 
> I could try to correct this old bug so that the client sends exactly
> one RPC Call and waits for a Reply. Since we are so close to the
> next merge window, I'm going to instead provide a simple patch to
> post enough Receives before a reconnect completes (based on the
> number of credits granted to the previous connection).
> 
> The spurious disconnects will be gone, but the client will still
> send multiple RPC Calls immediately after a reconnect.
> 
> Addressing the latter problem will wait for a merge window because
> a) I expect it to be a large change requiring lots of testing, and
> b) obviously the Linux client has interoperated successfully since
> day zero while still being broken.
> 
> Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/xprtrdma/verbs.c |    5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
> 
> Hi Anna-
> 
> Would it be possible to get this into linux-next and then v4.18-rc ?
> I know this is very late, but I found this issue only recently and
> it took a few days to nail down a simple fix.

Hi Anna, what's the status of this fix?


> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 16161a3..e8d1024 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -280,7 +280,6 @@
> 		++xprt->rx_xprt.connect_cookie;
> 		connstate = -ECONNABORTED;
> connected:
> -		xprt->rx_buf.rb_credits = 1;
> 		ep->rep_connected = connstate;
> 		rpcrdma_conn_func(ep);
> 		wake_up_all(&ep->rep_connect_wait);
> @@ -755,6 +754,7 @@
> 	}
> 
> 	ep->rep_connected = 0;
> +	rpcrdma_post_recvs(r_xprt, true);
> 
> 	rc = rdma_connect(ia->ri_id, &ep->rep_remote_cma);
> 	if (rc) {
> @@ -773,8 +773,6 @@
> 
> 	dprintk("RPC:       %s: connected\n", __func__);
> 
> -	rpcrdma_post_recvs(r_xprt, true);
> -
> out:
> 	if (rc)
> 		ep->rep_connected = rc;
> @@ -1171,6 +1169,7 @@ struct rpcrdma_req *
> 		list_add(&req->rl_list, &buf->rb_send_bufs);
> 	}
> 
> +	buf->rb_credits = 1;
> 	buf->rb_posted_receives = 0;
> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Schumaker, Anna Aug. 6, 2018, 3:22 p.m. UTC | #2
Hi Chuck,

On 08/06/2018 10:00 AM, Chuck Lever wrote:
> 
> 
>> On Jul 28, 2018, at 10:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> I found that injecting disconnects with v4.18-rc resulted in
>> random failures of the multi-threaded git regression test.
>>
>> The root cause appears to be that, after a reconnect, the
>> RPC/RDMA transport is waking pending RPCs before the transport has
>> posted enough Receive buffers to receive the Replies. If a Reply
>> arrives before enough Receive buffers are posted, the connection
>> is dropped. A few connection drops happen in quick succession as
>> the client and server struggle to regain credit synchronization.
>>
>> This regression was introduced with commit 7c8d9e7c8863 ("xprtrdma:
>> Move Receive posting to Receive handler"). The client is supposed to
>> post a single Receive when a connection is established because
>> it's not supposed to send more than one RPC Call before it gets
>> a fresh credit grant in the first RPC Reply [RFC 8166, Section
>> 3.3.3].
>>
>> Unfortunately there appears to be a longstanding bug in the Linux
>> client's credit accounting mechanism. On connect, it simply dumps
>> all pending RPC Calls onto the new connection. It's possible it has
>> done this ever since the RPC/RDMA transport was added to the kernel
>> ten years ago.
>>
>> Servers have so far been tolerant of this bad behavior. Currently no
>> server implementation ever changes its credit grant over reconnects,
>> and servers always repost enough Receives before connections are
>> fully established.
>>
>> The Linux client implementation used to post a Receive before each
>> of these Calls. This has covered up the flooding send behavior.
>>
>> I could try to correct this old bug so that the client sends exactly
>> one RPC Call and waits for a Reply. Since we are so close to the
>> next merge window, I'm going to instead provide a simple patch to
>> post enough Receives before a reconnect completes (based on the
>> number of credits granted to the previous connection).
>>
>> The spurious disconnects will be gone, but the client will still
>> send multiple RPC Calls immediately after a reconnect.
>>
>> Addressing the latter problem will wait for a merge window because
>> a) I expect it to be a large change requiring lots of testing, and
>> b) obviously the Linux client has interoperated successfully since
>> day zero while still being broken.
>>
>> Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xprtrdma/verbs.c |    5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> Hi Anna-
>>
>> Would it be possible to get this into linux-next and then v4.18-rc ?
>> I know this is very late, but I found this issue only recently and
>> it took a few days to nail down a simple fix.
> 
> Hi Anna, what's the status of this fix?

I have it queued up for 4.19 right now, with a cc:stable so it'll be backported once it's added.  Sorry that I didn't see your note about getting it into v4.18 earlier :( 

Anna

> 
> 
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 16161a3..e8d1024 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -280,7 +280,6 @@
>> 		++xprt->rx_xprt.connect_cookie;
>> 		connstate = -ECONNABORTED;
>> connected:
>> -		xprt->rx_buf.rb_credits = 1;
>> 		ep->rep_connected = connstate;
>> 		rpcrdma_conn_func(ep);
>> 		wake_up_all(&ep->rep_connect_wait);
>> @@ -755,6 +754,7 @@
>> 	}
>>
>> 	ep->rep_connected = 0;
>> +	rpcrdma_post_recvs(r_xprt, true);
>>
>> 	rc = rdma_connect(ia->ri_id, &ep->rep_remote_cma);
>> 	if (rc) {
>> @@ -773,8 +773,6 @@
>>
>> 	dprintk("RPC:       %s: connected\n", __func__);
>>
>> -	rpcrdma_post_recvs(r_xprt, true);
>> -
>> out:
>> 	if (rc)
>> 		ep->rep_connected = rc;
>> @@ -1171,6 +1169,7 @@ struct rpcrdma_req *
>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>> 	}
>>
>> +	buf->rb_credits = 1;
>> 	buf->rb_posted_receives = 0;
>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox series

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 16161a3..e8d1024 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -280,7 +280,6 @@ 
 		++xprt->rx_xprt.connect_cookie;
 		connstate = -ECONNABORTED;
 connected:
-		xprt->rx_buf.rb_credits = 1;
 		ep->rep_connected = connstate;
 		rpcrdma_conn_func(ep);
 		wake_up_all(&ep->rep_connect_wait);
@@ -755,6 +754,7 @@ 
 	}
 
 	ep->rep_connected = 0;
+	rpcrdma_post_recvs(r_xprt, true);
 
 	rc = rdma_connect(ia->ri_id, &ep->rep_remote_cma);
 	if (rc) {
@@ -773,8 +773,6 @@ 
 
 	dprintk("RPC:       %s: connected\n", __func__);
 
-	rpcrdma_post_recvs(r_xprt, true);
-
 out:
 	if (rc)
 		ep->rep_connected = rc;
@@ -1171,6 +1169,7 @@  struct rpcrdma_req *
 		list_add(&req->rl_list, &buf->rb_send_bufs);
 	}
 
+	buf->rb_credits = 1;
 	buf->rb_posted_receives = 0;
 	INIT_LIST_HEAD(&buf->rb_recv_bufs);