lockd: don't use timed rebind with TCP

Message ID	20201002225750.16452-1-calum.mackay@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=FADJ=DJ=vger.kernel.org=linux-nfs-owner@kernel.org> From: Calum Mackay <calum.mackay@oracle.com> To: trondmy@hammerspace.com, anna.schumaker@netapp.com Cc: linux-nfs@vger.kernel.org Subject: [PATCH] lockd: don't use timed rebind with TCP Date: Fri, 2 Oct 2020 23:57:50 +0100 Message-Id: <20201002225750.16452-1-calum.mackay@oracle.com> Precedence: bulk
Series	lockd: don't use timed rebind with TCP \| expand lockd: don't use timed rebind with TCP

Message ID

20201002225750.16452-1-calum.mackay@oracle.com (mailing list archive)

State

New, archived

Headers

From: Calum Mackay <calum.mackay@oracle.com>
To: trondmy@hammerspace.com, anna.schumaker@netapp.com
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH] lockd: don't use timed rebind with TCP
Date: Fri,  2 Oct 2020 23:57:50 +0100
Message-Id: <20201002225750.16452-1-calum.mackay@oracle.com>
Precedence: bulk

Series

lockd: don't use timed rebind with TCP | expand

Commit Message

Calum Mackay Oct. 2, 2020, 10:57 p.m. UTC

It is possible for nlm_bind_host() to clear XPRT_BOUND whilst a connection
worker is in the middle of trying to reconnect. When the latter notices
that XPRT_BOUND been cleared under it, in xs_tcp_finish_connecting(),
that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This is most likely to occur when there's a large number
of NLM client tasks following a server reboot.

Since the timed rebind would seem not to be needed for TCP in any case,
whilst the existing connection remains, restrict the timed rebinding to
UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, connection
error (including closure), and in the reclaimer.

Whilst there, refactor some duplicate code.

Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
---
 fs/lockd/host.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

Comments

Calum Mackay Oct. 3, 2020, 5:21 p.m. UTC | #1

Please hold off for now on this one; I think I need to adjust the 
reclaimer a little.

thanks,
calum.

On 02/10/2020 11:57 pm, Calum Mackay wrote:
> It is possible for nlm_bind_host() to clear XPRT_BOUND whilst a connection
> worker is in the middle of trying to reconnect. When the latter notices
> that XPRT_BOUND been cleared under it, in xs_tcp_finish_connecting(),
> that results in:
> 
> 	xs_tcp_setup_socket: connect returned unhandled error -107
> 
> Worse, it's possible that the two can get into lockstep, resulting in
> the same behaviour repeated indefinitely, with the above error every
> 300 seconds, without ever recovering, and the connection never being
> established. This is most likely to occur when there's a large number
> of NLM client tasks following a server reboot.
> 
> Since the timed rebind would seem not to be needed for TCP in any case,
> whilst the existing connection remains, restrict the timed rebinding to
> UDP only.
> 
> For TCP, we will still rebind when needed, e.g. on timeout, connection
> error (including closure), and in the reclaimer.
> 
> Whilst there, refactor some duplicate code.
> 
> Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
> ---
>   fs/lockd/host.c | 16 +++++++---------
>   1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index 0afb6d59bad0..6e98c2ed6ffc 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -439,12 +439,7 @@ nlm_bind_host(struct nlm_host *host)
>   	 * RPC rebind is required
>   	 */
>   	if ((clnt = host->h_rpcclnt) != NULL) {
> -		if (time_after_eq(jiffies, host->h_nextrebind)) {
> -			rpc_force_rebind(clnt);
> -			host->h_nextrebind = jiffies + NLM_HOST_REBIND;
> -			dprintk("lockd: next rebind in %lu jiffies\n",
> -					host->h_nextrebind - jiffies);
> -		}
> +		nlm_rebind_host(host);
>   	} else {
>   		unsigned long increment = nlmsvc_timeout;
>   		struct rpc_timeout timeparms = {
> @@ -495,15 +490,18 @@ nlm_bind_host(struct nlm_host *host)
>   }
>   
>   /*
> - * Force a portmap lookup of the remote lockd port
> + * Force a portmap lookup of the remote lockd port, unless we're using a
> + * TCP connection.
>    */
>   void
>   nlm_rebind_host(struct nlm_host *host)
>   {
> -	dprintk("lockd: rebind host %s\n", host->h_name);
> -	if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) {
> +	if (unlikely(host->h_proto == IPPROTO_UDP) && host->h_rpcclnt &&
> +			time_after_eq(jiffies, host->h_nextrebind)) {
>   		rpc_force_rebind(host->h_rpcclnt);
>   		host->h_nextrebind = jiffies + NLM_HOST_REBIND;
> +		dprintk("lockd: rebind host %s; next rebind in %lu jiffies\n",
> +			host->h_name, host->h_nextrebind - jiffies);
>   	}
>   }
>   
>

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index 0afb6d59bad0..6e98c2ed6ffc 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -439,12 +439,7 @@  nlm_bind_host(struct nlm_host *host)
 	 * RPC rebind is required
 	 */
 	if ((clnt = host->h_rpcclnt) != NULL) {
-		if (time_after_eq(jiffies, host->h_nextrebind)) {
-			rpc_force_rebind(clnt);
-			host->h_nextrebind = jiffies + NLM_HOST_REBIND;
-			dprintk("lockd: next rebind in %lu jiffies\n",
-					host->h_nextrebind - jiffies);
-		}
+		nlm_rebind_host(host);
 	} else {
 		unsigned long increment = nlmsvc_timeout;
 		struct rpc_timeout timeparms = {
@@ -495,15 +490,18 @@  nlm_bind_host(struct nlm_host *host)
 }
 
 /*
- * Force a portmap lookup of the remote lockd port
+ * Force a portmap lookup of the remote lockd port, unless we're using a
+ * TCP connection.
  */
 void
 nlm_rebind_host(struct nlm_host *host)
 {
-	dprintk("lockd: rebind host %s\n", host->h_name);
-	if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) {
+	if (unlikely(host->h_proto == IPPROTO_UDP) && host->h_rpcclnt &&
+			time_after_eq(jiffies, host->h_nextrebind)) {
 		rpc_force_rebind(host->h_rpcclnt);
 		host->h_nextrebind = jiffies + NLM_HOST_REBIND;
+		dprintk("lockd: rebind host %s; next rebind in %lu jiffies\n",
+			host->h_name, host->h_nextrebind - jiffies);
 	}
 }

lockd: don't use timed rebind with TCP

Commit Message

Comments

Patch