Message ID | 20210221182700.1494-1-timo@rothenpieler.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nfsd: set RPC_CLNT_CREATE_NO_IDLE_TIMEOUT on callback client | expand |
Thanks, Timo. A handful of nits below: > On Feb 21, 2021, at 1:27 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote: > > This tackles an issue where the callback client times out from > inactivity, causing operations like server side copy to never return on > the client side. > I was observing that issue frequently on my RDMA connected clients, it > does not seem to manifest on tcp connected clients. Indeed, it is curious that the COPY issue does not occur on TCP connections. You could try using the same tracing technique to collect some data on TCP to see what is different. > However, it does not fix the actual issue of the callback channel > not getting re-established and the client being stuck in the call > forever. It just makes it a lot less likely to occur, as long as no > other circumstances cause the callback channel to be disconnected. Agreed. I'm hoping Olga or Dai will look further into why recovery is failing in this case (and whether that missing recovery action is also observed only on RDMA transports!). Please add a Signed-off-by: tag. See the "Sign your work" section in Documentation/process/submitting-patches.rst > --- > fs/nfsd/nfs4callback.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index 052be5bf9ef5..75dacb7878b8 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -897,7 +897,7 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c > .timeout = &timeparms, > .program = &cb_program, > .version = 1, > - .flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET), > + .flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET | RPC_CLNT_CREATE_NO_IDLE_TIMEOUT), Kernel coding style keeps lines at 80 characters or fewer. Please find a way to keep the replacement line under 80 characters. > .cred = current_cred(), > }; > struct rpc_clnt *client; > -- > 2.25.1 > Once you have received other review comments, you might wish to submit this patch again. Be sure to update the Subject: line to say "[PATCH v2]". -- Chuck Lever
> On Feb 22, 2021, at 4:03 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote: > > On 21.02.2021 20:26, Chuck Lever wrote: >> Thanks, Timo. A handful of nits below: >>> On Feb 21, 2021, at 1:27 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote: >>> >>> This tackles an issue where the callback client times out from >>> inactivity, causing operations like server side copy to never return on >>> the client side. >>> I was observing that issue frequently on my RDMA connected clients, it >>> does not seem to manifest on tcp connected clients. >> Indeed, it is curious that the COPY issue does not occur on TCP >> connections. You could try using the same tracing technique to >> collect some data on TCP to see what is different. > > I did pretty much the same procedure, mount, copy, wait 10 minutes, copy again, just with it mounted via TCP instead of RDMA. > (And I made sure to boot the server on an unpatched kernel). > Worked perfectly fine. > The pcap and trace are attached. > > To me it looks like it just never hits xprt_disconnect_auto, despite the no timeout flag not being set. > Maybe it gets set implicitly somewhere for TCP? Or there is some keep alive going on for TCP, that never gives it a chance to time out. net/sunrpc/xprtsock.c: 2999 /* backchannel */ 3000 xprt_set_bound(xprt); 3001 xprt->bind_timeout = 0; 3002 xprt->reestablish_timeout = 0; 3003 xprt->idle_timeout = 0; But no similar setting for net/sunrpc/xprtrdma/svc_rdma_backchannel.c: 253 xprt_set_bound(xprt); 254 xprt_set_connected(xprt); 255 xprt->bind_timeout = RPCRDMA_BIND_TO; 256 xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO; 257 xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO; And perhaps these have been wrong since 5d252f90a800 ("svcrdma: Add class for RDMA backwards direction transport"). Try replacing your previous patch with one that changes all these backchannel settings to 0. It should have the same effect as using NO_IDLE_TIMEOUT. >>> However, it does not fix the actual issue of the callback channel >>> not getting re-established and the client being stuck in the call >>> forever. It just makes it a lot less likely to occur, as long as no >>> other circumstances cause the callback channel to be disconnected. >> Agreed. I'm hoping Olga or Dai will look further into why recovery >> is failing in this case (and whether that missing recovery action >> is also observed only on RDMA transports!). >> Please add a Signed-off-by: tag. See the "Sign your work" section >> in Documentation/process/submitting-patches.rst > > <trace.dat.xz><traffic.pcap.xz> -- Chuck Lever
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index 052be5bf9ef5..75dacb7878b8 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -897,7 +897,7 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c .timeout = &timeparms, .program = &cb_program, .version = 1, - .flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET), + .flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET | RPC_CLNT_CREATE_NO_IDLE_TIMEOUT), .cred = current_cred(), }; struct rpc_clnt *client;