lockd: don't use interval-based rebinding over TCP

Message ID	20201028201627.23625-1-calum.mackay@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=eJj9=EE=vger.kernel.org=linux-nfs-owner@kernel.org> From: Calum Mackay <calum.mackay@oracle.com> To: trondmy@hammerspace.com, anna.schumaker@netapp.com Cc: linux-nfs@vger.kernel.org Subject: [PATCH] lockd: don't use interval-based rebinding over TCP Date: Wed, 28 Oct 2020 20:16:27 +0000 Message-Id: <20201028201627.23625-1-calum.mackay@oracle.com> Precedence: bulk
Series	lockd: don't use interval-based rebinding over TCP \| expand lockd: don't use interval-based rebinding over TCP

Message ID

20201028201627.23625-1-calum.mackay@oracle.com (mailing list archive)

State

New, archived

Headers

From: Calum Mackay <calum.mackay@oracle.com>
To: trondmy@hammerspace.com, anna.schumaker@netapp.com
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH] lockd: don't use interval-based rebinding over TCP
Date: Wed, 28 Oct 2020 20:16:27 +0000
Message-Id: <20201028201627.23625-1-calum.mackay@oracle.com>
Precedence: bulk

Series

lockd: don't use interval-based rebinding over TCP | expand

Commit Message

Calum Mackay Oct. 28, 2020, 8:16 p.m. UTC

NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.

This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.

It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().

When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.

The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.

To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.

Drop the dprintk, which adds no value over a trace.

Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
---
 fs/lockd/host.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

Comments

Calum Mackay Nov. 12, 2020, 9:04 p.m. UTC | #1

hi Anna, Trond,

was there any comment on this, please?

thanks,
calum.

On 28/10/2020 8:16 pm, Calum Mackay wrote:
> NLM uses an interval-based rebinding, i.e. it clears the transport's
> binding under certain conditions if more than 60 seconds have elapsed
> since the connection was last bound.
> 
> This rebinding is not necessary for an autobind RPC client over a
> connection-oriented protocol like TCP.
> 
> It can also cause problems: it is possible for nlm_bind_host() to clear
> XPRT_BOUND whilst a connection worker is in the middle of trying to
> reconnect, after it had already been checked in xprt_connect().
> 
> When the connection worker notices that XPRT_BOUND has been cleared
> under it, in xs_tcp_finish_connecting(), that results in:
> 
> 	xs_tcp_setup_socket: connect returned unhandled error -107
> 
> Worse, it's possible that the two can get into lockstep, resulting in
> the same behaviour repeated indefinitely, with the above error every
> 300 seconds, without ever recovering, and the connection never being
> established. This has been seen in practice, with a large number of NLM
> client tasks, following a server restart.
> 
> The existing callers of nlm_bind_host & nlm_rebind_host should not need
> to force the rebind, for TCP, so restrict the interval-based rebinding
> to UDP only.
> 
> For TCP, we will still rebind when needed, e.g. on timeout, and connection
> error (including closure), since connection-related errors on an existing
> connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
> already unconditionally clear XPRT_BOUND.
> 
> To avoid having to add the fix, and explanation, to both nlm_bind_host()
> and nlm_rebind_host(), remove the duplicate code from the former, and
> have it call the latter.
> 
> Drop the dprintk, which adds no value over a trace.
> 
> Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
> ---
>   fs/lockd/host.c | 20 +++++++++++---------
>   1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index 0afb6d59bad0..771c289f6df7 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -439,12 +439,7 @@ nlm_bind_host(struct nlm_host *host)
>   	 * RPC rebind is required
>   	 */
>   	if ((clnt = host->h_rpcclnt) != NULL) {
> -		if (time_after_eq(jiffies, host->h_nextrebind)) {
> -			rpc_force_rebind(clnt);
> -			host->h_nextrebind = jiffies + NLM_HOST_REBIND;
> -			dprintk("lockd: next rebind in %lu jiffies\n",
> -					host->h_nextrebind - jiffies);
> -		}
> +		nlm_rebind_host(host);
>   	} else {
>   		unsigned long increment = nlmsvc_timeout;
>   		struct rpc_timeout timeparms = {
> @@ -494,13 +489,20 @@ nlm_bind_host(struct nlm_host *host)
>   	return clnt;
>   }
>   
> -/*
> - * Force a portmap lookup of the remote lockd port
> +/**
> + * nlm_rebind_host - If needed, force a portmap lookup of the peer's lockd port
> + * @host: NLM host handle for peer
> + *
> + * This is not needed when using a connection-oriented protocol, such as TCP.
> + * The existing autobind mechanism is sufficient to force a rebind when
> + * required, e.g. on connection state transitions.
>    */
>   void
>   nlm_rebind_host(struct nlm_host *host)
>   {
> -	dprintk("lockd: rebind host %s\n", host->h_name);
> +	if (host->h_proto != IPPROTO_UDP)
> +		return;
> +
>   	if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) {
>   		rpc_force_rebind(host->h_rpcclnt);
>   		host->h_nextrebind = jiffies + NLM_HOST_REBIND;
>

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index 0afb6d59bad0..771c289f6df7 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -439,12 +439,7 @@  nlm_bind_host(struct nlm_host *host)
 	 * RPC rebind is required
 	 */
 	if ((clnt = host->h_rpcclnt) != NULL) {
-		if (time_after_eq(jiffies, host->h_nextrebind)) {
-			rpc_force_rebind(clnt);
-			host->h_nextrebind = jiffies + NLM_HOST_REBIND;
-			dprintk("lockd: next rebind in %lu jiffies\n",
-					host->h_nextrebind - jiffies);
-		}
+		nlm_rebind_host(host);
 	} else {
 		unsigned long increment = nlmsvc_timeout;
 		struct rpc_timeout timeparms = {
@@ -494,13 +489,20 @@  nlm_bind_host(struct nlm_host *host)
 	return clnt;
 }
 
-/*
- * Force a portmap lookup of the remote lockd port
+/**
+ * nlm_rebind_host - If needed, force a portmap lookup of the peer's lockd port
+ * @host: NLM host handle for peer
+ *
+ * This is not needed when using a connection-oriented protocol, such as TCP.
+ * The existing autobind mechanism is sufficient to force a rebind when
+ * required, e.g. on connection state transitions.
  */
 void
 nlm_rebind_host(struct nlm_host *host)
 {
-	dprintk("lockd: rebind host %s\n", host->h_name);
+	if (host->h_proto != IPPROTO_UDP)
+		return;
+
 	if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) {
 		rpc_force_rebind(host->h_rpcclnt);
 		host->h_nextrebind = jiffies + NLM_HOST_REBIND;

lockd: don't use interval-based rebinding over TCP

Commit Message

Comments

Patch