diff mbox

[BUG] nfs3 client stops retrying to connect

Message ID 20150825151614.GA31127@bender.morinfr.org (mailing list archive)
State New, archived
Headers show

Commit Message

Guillaume Morin Aug. 25, 2015, 3:16 p.m. UTC
On 08 Jun 20:12, Guillaume Morin wrote:
>
> On 08 Jun 13:50, Chuck Lever wrote:
> > The linger timer is started by FIN_WAIT1 or LAST_ACK, and
> > xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
> > XPRT_CONNECTION_ABORT.
> > 
> > At a guess there could be a race between xs_tcp_cancel_linger_timeout
> > and the connect worker clearing those flags.
> 
> The connect worker is xs_tcp_setup_socket().  It clears the connecting
> bit in all code paths.  So the only kind of race I can see here is
> another function cancelling it before it runs without clearing the bit.
> 
> xs_tcp_cancel_linger_timeout() does the right thing afaict.  It clears
> the bit if cancel_delayed_work() returns a non-zero value.
> 
> The only other place where the worker is cancelled is xs_close() but it
> does not clear the bit. So if it cancels the worker before it had
> started running, the bit will stay up.

FWIW I patched our production kernel a couple months ago to clear the
connecting bit in xs_close(). Since then we've had a few nfs server
downtime and the problem has never reoccured while before the change we
always had a few machines that could not reconnect.  I feel fairly
confident this was the bug.

I am posting the change in case it helps someone running one of the
stable kernels

    sunrpc: call xprt_clear_connecting in xs_close
    
    It closes the race where the CONNECTING bit in the xprt
    is left on while the kernel is not trying to connect



Another option would be is to call clear_bit a few lines later but
clear_bit is never used for CONNECTING so I went with this.
diff mbox

Patch

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 41c2f9d..1b71c59 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -891,6 +891,7 @@  static void xs_close(struct rpc_xprt *xprt)
 	dprintk("RPC:       xs_close xprt %p\n", xprt);
 
 	cancel_delayed_work_sync(&transport->connect_worker);
+	xprt_clear_connecting(xprt);
 
 	xs_reset_transport(transport);
 	xprt->reestablish_timeout = 0;