Message ID | 20200623152409.70257-1-olga.kornievskaia@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] SUNRPC dont update timeout value on connection reset | expand |
Trond/Anna, Any comments on this patch? On Tue, Jun 23, 2020 at 11:22 AM Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > Current behaviour: every time a v3 operation is re-sent to the server > we update (double) the timeout. There is no distinction between whether > or not the previous timer had expired before the re-sent happened. > > Here's the scenario: > 1. Client sends a v3 operation > 2. Server RST-s the connection (prior to the timeout) (eg., connection > is immediately reset) > 3. Client re-sends a v3 operation but the timeout is now 120sec. > > As a result, an application sees 2mins pause before a retry in case > server again does not reply. Where as if a connection reset didn't > change the timeout value, the client would have re-tried (the 3rd > time) after 60secs. > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > --- I have a question with regards to should we also not update the > number of retries when connection is RST-ed? This would allow the > client to still weather a 6mins (60+120+180) of unresponsive server. > After this patch the client can handle only 3mins (60+120) of > unresponsive server after the initial RST --- > --- > net/sunrpc/clnt.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > index a91d1cd..65517cf 100644 > --- a/net/sunrpc/clnt.c > +++ b/net/sunrpc/clnt.c > @@ -2405,7 +2405,8 @@ void rpc_force_rebind(struct rpc_clnt *clnt) > goto out_exit; > } > task->tk_action = call_encode; > - rpc_check_timeout(task); > + if (status != -ECONNRESET && status != -ECONNABORTED) > + rpc_check_timeout(task); > return; > out_exit: > rpc_call_rpcerror(task, status); > -- > 1.8.3.1 >
On Sun, 2020-06-28 at 14:03 -0400, Olga Kornievskaia wrote: > Trond/Anna, > > Any comments on this patch? > > On Tue, Jun 23, 2020 at 11:22 AM Olga Kornievskaia > <olga.kornievskaia@gmail.com> wrote: > > Current behaviour: every time a v3 operation is re-sent to the > > server > > we update (double) the timeout. There is no distinction between > > whether > > or not the previous timer had expired before the re-sent happened. > > > > Here's the scenario: > > 1. Client sends a v3 operation > > 2. Server RST-s the connection (prior to the timeout) (eg., > > connection > > is immediately reset) > > 3. Client re-sends a v3 operation but the timeout is now 120sec. Ah... The problem here is clearly '3.' incrementing the timeout value before we've actually hit a minor or major timeout... So I think we want to look carefully at xprt_adjust_timeout(). The first rule there should be that if we're below the threshold for a minor timeout, we just want to exit without changing anything. The second rule is then that if we're below the threshold for a major timeout, then we adjust the timeout value by doubling it (if to- >to_exponential) or adding the value to->to_increment (if !to- >to_exponential) and then exit. Finally, if this is a major timeout, we reset req->rq_timeout to to- >to_initval, reset req->rq_retries, call xprt_reset_majortimeo(), reset the RTT counters and return ETIMEDOUT. None of this should be specific to your connection reset case. This is how we want timeouts to work in the generic case, so we need to fix that.
On Sun, Jun 28, 2020 at 5:16 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Sun, 2020-06-28 at 14:03 -0400, Olga Kornievskaia wrote: > > Trond/Anna, > > > > Any comments on this patch? > > > > On Tue, Jun 23, 2020 at 11:22 AM Olga Kornievskaia > > <olga.kornievskaia@gmail.com> wrote: > > > Current behaviour: every time a v3 operation is re-sent to the > > > server > > > we update (double) the timeout. There is no distinction between > > > whether > > > or not the previous timer had expired before the re-sent happened. > > > > > > Here's the scenario: > > > 1. Client sends a v3 operation > > > 2. Server RST-s the connection (prior to the timeout) (eg., > > > connection > > > is immediately reset) > > > 3. Client re-sends a v3 operation but the timeout is now 120sec. > > Ah... The problem here is clearly '3.' incrementing the timeout value > before we've actually hit a minor or major timeout... > > So I think we want to look carefully at xprt_adjust_timeout(). The > first rule there should be that if we're below the threshold for a > minor timeout, we just want to exit without changing anything. > > The second rule is then that if we're below the threshold for a major > timeout, then we adjust the timeout value by doubling it (if to- > >to_exponential) or adding the value to->to_increment (if !to- > >to_exponential) and then exit. > > Finally, if this is a major timeout, we reset req->rq_timeout to to- > >to_initval, reset req->rq_retries, call xprt_reset_majortimeo(), reset > the RTT counters and return ETIMEDOUT. > > None of this should be specific to your connection reset case. This is > how we want timeouts to work in the generic case, so we need to fix > that. > Ok thanks for comments. I don't know if I got it right but I submitted a new version. > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index a91d1cd..65517cf 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -2405,7 +2405,8 @@ void rpc_force_rebind(struct rpc_clnt *clnt) goto out_exit; } task->tk_action = call_encode; - rpc_check_timeout(task); + if (status != -ECONNRESET && status != -ECONNABORTED) + rpc_check_timeout(task); return; out_exit: rpc_call_rpcerror(task, status);
Current behaviour: every time a v3 operation is re-sent to the server we update (double) the timeout. There is no distinction between whether or not the previous timer had expired before the re-sent happened. Here's the scenario: 1. Client sends a v3 operation 2. Server RST-s the connection (prior to the timeout) (eg., connection is immediately reset) 3. Client re-sends a v3 operation but the timeout is now 120sec. As a result, an application sees 2mins pause before a retry in case server again does not reply. Where as if a connection reset didn't change the timeout value, the client would have re-tried (the 3rd time) after 60secs. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> --- I have a question with regards to should we also not update the number of retries when connection is RST-ed? This would allow the client to still weather a 6mins (60+120+180) of unresponsive server. After this patch the client can handle only 3mins (60+120) of unresponsive server after the initial RST --- --- net/sunrpc/clnt.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)