Message ID | 20190220145650.21566-1-olga.kornievskaia@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] SUNRPC: fix handling of half-closed connection | expand |
Hi Olga, Do you have a reproducer for this? A number of months ago I did a significant amount of testing with half-closed connections, after we had reports of connections stuck in FIN_WAIT2 in some older kernels. What I found was with kernels that had the tcp keepalives (commit 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a hang of a few minutes, after which time the tcp keepalive code would reset the connection. That said it was a while ago and something subtle may have changed. Also I'm not not sure if your header implies an indefinite hang or just a few minutes. Thanks. On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > From: Olga Kornievskaia <kolga@netapp.com> > > When server replies with an ACK to client's FIN/ACK, client ends > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > Instead, make sure to close and reset client's socket and transport > when transitioned into that state. > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > --- > net/sunrpc/xprtsock.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > index 618e9c2..812e5e3 100644 > --- a/net/sunrpc/xprtsock.c > +++ b/net/sunrpc/xprtsock.c > @@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock > *sk) > clear_bit(XPRT_CLOSE_WAIT, &xprt->state); > smp_mb__after_atomic(); > break; > + case TCP_FIN_WAIT2: > case TCP_CLOSE_WAIT: > /* The server initiated a shutdown of the socket */ > xprt->connect_cookie++; > @@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt > *xprt) > kernel_sock_shutdown(sock, SHUT_RDWR); > trace_rpc_socket_shutdown(xprt, sock); > break; > + case TCP_FIN_WAIT2: > case TCP_CLOSE: > case TCP_TIME_WAIT: > xs_reset_transport(transport);
On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > Hi Olga, > > Do you have a reproducer for this? A number of months ago I did a > significant amount of testing with half-closed connections, after we > had reports of connections stuck in FIN_WAIT2 in some older kernels. > What I found was with kernels that had the tcp keepalives (commit > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a > hang of a few minutes, after which time the tcp keepalive code would > reset the connection. > > That said it was a while ago and something subtle may have changed. > Also I'm not not sure if your header implies an indefinite hang or > just > a few minutes. > > Thanks. > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > From: Olga Kornievskaia <kolga@netapp.com> > > > > When server replies with an ACK to client's FIN/ACK, client ends > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > Instead, make sure to close and reset client's socket and transport > > when transitioned into that state. So, please do note that we do not want to ignore the FIN_WAIT2 state because it implies that the server has not closed the socket on its side. That again means that we cannot re-establish a connection using the same source IP+port to the server, which is problematic for protocols such as NFSv3 which rely on standard duplicate reply cache for correct replay semantics. This is why we don't just set the TCP_LINGER2 socket option and call sock_release(). The choice to try to wait it out is deliberate because the alternative is that we end up with busy-waiting re-connection attempts. Cheers Trond
Hi Dave, A re-producer is a server that sends an ACK to the client's FIN/ACK request and does nothing afterwards (I can reproduce it 100% with a hacked up server. It was discovered with a "broken" server that doesn't fully closes a connection). It leave this client unable to connect to this server again indefinitely/forever/reboot required kind of state. Once it was considered that doing something like that to the client is a form of an attack (denial-of-server) and thus the kernel has a tcp_fin_timeout option after which the kernel will abort the connection. However this only applies to the sockets that have been closed by the client. This is NOT the case. NFS does not close the connection and it ignores kernel's notification of FIN_WAIT2 state. One can argue that this is a broken server and we shouldn't bother. But this patch is an attempt to argue that the client still should care and deal with this condition. However, if the community feels that a broken server is a broken server and this form of an attack is not interested, this patch can live will be an archive for later or never. On Fri, Feb 22, 2019 at 7:12 AM Dave Wysochanski <dwysocha@redhat.com> wrote: > > Hi Olga, > > Do you have a reproducer for this? A number of months ago I did a > significant amount of testing with half-closed connections, after we > had reports of connections stuck in FIN_WAIT2 in some older kernels. > What I found was with kernels that had the tcp keepalives (commit > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a > hang of a few minutes, after which time the tcp keepalive code would > reset the connection. > > That said it was a while ago and something subtle may have changed. > Also I'm not not sure if your header implies an indefinite hang or just > a few minutes. > > Thanks. > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > From: Olga Kornievskaia <kolga@netapp.com> > > > > When server replies with an ACK to client's FIN/ACK, client ends > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > Instead, make sure to close and reset client's socket and transport > > when transitioned into that state. > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > --- > > net/sunrpc/xprtsock.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > index 618e9c2..812e5e3 100644 > > --- a/net/sunrpc/xprtsock.c > > +++ b/net/sunrpc/xprtsock.c > > @@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock > > *sk) > > clear_bit(XPRT_CLOSE_WAIT, &xprt->state); > > smp_mb__after_atomic(); > > break; > > + case TCP_FIN_WAIT2: > > case TCP_CLOSE_WAIT: > > /* The server initiated a shutdown of the socket */ > > xprt->connect_cookie++; > > @@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt > > *xprt) > > kernel_sock_shutdown(sock, SHUT_RDWR); > > trace_rpc_socket_shutdown(xprt, sock); > > break; > > + case TCP_FIN_WAIT2: > > case TCP_CLOSE: > > case TCP_TIME_WAIT: > > xs_reset_transport(transport);
On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > Hi Olga, > > > > Do you have a reproducer for this? A number of months ago I did a > > significant amount of testing with half-closed connections, after we > > had reports of connections stuck in FIN_WAIT2 in some older kernels. > > What I found was with kernels that had the tcp keepalives (commit > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a > > hang of a few minutes, after which time the tcp keepalive code would > > reset the connection. > > > > That said it was a while ago and something subtle may have changed. > > Also I'm not not sure if your header implies an indefinite hang or > > just > > a few minutes. > > > > Thanks. > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > When server replies with an ACK to client's FIN/ACK, client ends > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > > Instead, make sure to close and reset client's socket and transport > > > when transitioned into that state. > Hi Trond, > So, please do note that we do not want to ignore the FIN_WAIT2 state But we do ignore the FIN_WAIT2 state. > because it implies that the server has not closed the socket on its > side. That's correct. > That again means that we cannot re-establish a connection using > the same source IP+port to the server, which is problematic for > protocols such as NFSv3 which rely on standard duplicate reply cache > for correct replay semantics. that's exactly what's happening that a client is unable to establish a new connection to the server. With the patch, the client does an RST and it re-uses the port and all is well for NFSv3. > This is why we don't just set the TCP_LINGER2 socket option and call > sock_release(). The choice to try to wait it out is deliberate because > the alternative is that we end up with busy-waiting re-connection > attempts. Why would it busy-wait? In my testing, RST happens and new connection is established? > > Cheers > Trond > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote: > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust < > trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > > Hi Olga, > > > > > > Do you have a reproducer for this? A number of months ago I did > > > a > > > significant amount of testing with half-closed connections, after > > > we > > > had reports of connections stuck in FIN_WAIT2 in some older > > > kernels. > > > What I found was with kernels that had the tcp keepalives (commit > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce > > > a > > > hang of a few minutes, after which time the tcp keepalive code > > > would > > > reset the connection. > > > > > > That said it was a while ago and something subtle may have > > > changed. > > > Also I'm not not sure if your header implies an indefinite hang > > > or > > > just > > > a few minutes. > > > > > > Thanks. > > > > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > When server replies with an ACK to client's FIN/ACK, client > > > > ends > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > > > Instead, make sure to close and reset client's socket and > > > > transport > > > > when transitioned into that state. > > Hi Trond, > > > So, please do note that we do not want to ignore the FIN_WAIT2 > > state > > But we do ignore the FIN_WAIT2 state. We do not. We wait for the server to send a FIN, which is precisely the reason for which FIN_WAIT2 exists. > > > because it implies that the server has not closed the socket on its > > side. > > That's correct. > > > That again means that we cannot re-establish a connection using > > the same source IP+port to the server, which is problematic for > > protocols such as NFSv3 which rely on standard duplicate reply > > cache > > for correct replay semantics. > > that's exactly what's happening that a client is unable to establish > a > new connection to the server. With the patch, the client does an RST > and it re-uses the port and all is well for NFSv3. RST is not guaranteed to be delivered to the recipient. That's why the TCP protocol defines FIN: it is a guaranteed to be delivered because it is ACKed. > > This is why we don't just set the TCP_LINGER2 socket option and > > call > > sock_release(). The choice to try to wait it out is deliberate > > because > > the alternative is that we end up with busy-waiting re-connection > > attempts. > > Why would it busy-wait? In my testing, RST happens and new connection > is established? Only if the server has dropped the connection without notifying the client.
On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote: > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust < > > trondmy@hammerspace.com> wrote: > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > > > Hi Olga, > > > > > > > > Do you have a reproducer for this? A number of months ago I did > > > > a > > > > significant amount of testing with half-closed connections, after > > > > we > > > > had reports of connections stuck in FIN_WAIT2 in some older > > > > kernels. > > > > What I found was with kernels that had the tcp keepalives (commit > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce > > > > a > > > > hang of a few minutes, after which time the tcp keepalive code > > > > would > > > > reset the connection. > > > > > > > > That said it was a while ago and something subtle may have > > > > changed. > > > > Also I'm not not sure if your header implies an indefinite hang > > > > or > > > > just > > > > a few minutes. > > > > > > > > Thanks. > > > > > > > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > > > When server replies with an ACK to client's FIN/ACK, client > > > > > ends > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > > > > Instead, make sure to close and reset client's socket and > > > > > transport > > > > > when transitioned into that state. > > > > Hi Trond, > > > > > So, please do note that we do not want to ignore the FIN_WAIT2 > > > state > > > > But we do ignore the FIN_WAIT2 state. > > We do not. We wait for the server to send a FIN, which is precisely the > reason for which FIN_WAIT2 exists. > > > > > > because it implies that the server has not closed the socket on its > > > side. > > > > That's correct. > > > > > That again means that we cannot re-establish a connection using > > > the same source IP+port to the server, which is problematic for > > > protocols such as NFSv3 which rely on standard duplicate reply > > > cache > > > for correct replay semantics. > > > > that's exactly what's happening that a client is unable to establish > > a > > new connection to the server. With the patch, the client does an RST > > and it re-uses the port and all is well for NFSv3. > > RST is not guaranteed to be delivered to the recipient. That's why the > TCP protocol defines FIN: it is a guaranteed to be delivered because it > is ACKed. > > > > This is why we don't just set the TCP_LINGER2 socket option and > > > call > > > sock_release(). The choice to try to wait it out is deliberate > > > because > > > the alternative is that we end up with busy-waiting re-connection > > > attempts. > > > > Why would it busy-wait? In my testing, RST happens and new connection > > is established? > > Only if the server has dropped the connection without notifying the > client. Yes the server dropped the connection without notifying the client (or perhaps something in the middle did it as an attack). Again, I raise this concern for the sake of dealing with this as an attack. I have no intentions of catering to broken servers. If this is not a possible attack, then we don't have to deal with it.
On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote: > On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust > <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote: > > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust < > > > trondmy@hammerspace.com> wrote: > > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > > > > Hi Olga, > > > > > > > > > > Do you have a reproducer for this? A number of months ago I > > > > > did > > > > > a > > > > > significant amount of testing with half-closed connections, > > > > > after > > > > > we > > > > > had reports of connections stuck in FIN_WAIT2 in some older > > > > > kernels. > > > > > What I found was with kernels that had the tcp keepalives > > > > > (commit > > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only > > > > > reproduce > > > > > a > > > > > hang of a few minutes, after which time the tcp keepalive > > > > > code > > > > > would > > > > > reset the connection. > > > > > > > > > > That said it was a while ago and something subtle may have > > > > > changed. > > > > > Also I'm not not sure if your header implies an indefinite > > > > > hang > > > > > or > > > > > just > > > > > a few minutes. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > > > > > When server replies with an ACK to client's FIN/ACK, client > > > > > > ends > > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > > > > > Instead, make sure to close and reset client's socket and > > > > > > transport > > > > > > when transitioned into that state. > > > > > > Hi Trond, > > > > > > > So, please do note that we do not want to ignore the FIN_WAIT2 > > > > state > > > > > > But we do ignore the FIN_WAIT2 state. > > > > We do not. We wait for the server to send a FIN, which is precisely > > the > > reason for which FIN_WAIT2 exists. > > > > > > because it implies that the server has not closed the socket on > > > > its > > > > side. > > > > > > That's correct. > > > > > > > That again means that we cannot re-establish a connection using > > > > the same source IP+port to the server, which is problematic for > > > > protocols such as NFSv3 which rely on standard duplicate reply > > > > cache > > > > for correct replay semantics. > > > > > > that's exactly what's happening that a client is unable to > > > establish > > > a > > > new connection to the server. With the patch, the client does an > > > RST > > > and it re-uses the port and all is well for NFSv3. > > > > RST is not guaranteed to be delivered to the recipient. That's why > > the > > TCP protocol defines FIN: it is a guaranteed to be delivered > > because it > > is ACKed. > > > > > > This is why we don't just set the TCP_LINGER2 socket option and > > > > call > > > > sock_release(). The choice to try to wait it out is deliberate > > > > because > > > > the alternative is that we end up with busy-waiting re- > > > > connection > > > > attempts. > > > > > > Why would it busy-wait? In my testing, RST happens and new > > > connection > > > is established? > > > > Only if the server has dropped the connection without notifying the > > client. > > Yes the server dropped the connection without notifying the client > (or > perhaps something in the middle did it as an attack). Again, I raise > this concern for the sake of dealing with this as an attack. I have > no > intentions of catering to broken servers. If this is not a possible > attack, then we don't have to deal with it. A man in the middle might be able to intercept the FIN from the server and ACK it, causing the connection to be closed on that server. However, as Dave pointed out, why wouldn't the keepalive mechanism then eventually kick in and close the socket on the client as well? If the FIN is not ACKed, then the server is supposed to keep retransmitting it. Until that ACK is received, it cannot close the socket without violating the TCP protocol.
On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > Hi Dave, > > A re-producer is a server that sends an ACK to the client's FIN/ACK > request and does nothing afterwards (I can reproduce it 100% with a > hacked up server. It was discovered with a "broken" server that > doesn't fully closes a connection). It leave this client unable to > connect to this server again indefinitely/forever/reboot required > kind > of state. Once it was considered that doing something like that to > the > client is a form of an attack (denial-of-server) and thus the kernel > has a tcp_fin_timeout option after which the kernel will abort the > connection. However this only applies to the sockets that have been > closed by the client. This is NOT the case. NFS does not close the > connection and it ignores kernel's notification of FIN_WAIT2 state. > Interesting. I had the same reproducer but eventually an RST would get sent from the NFS client due to the TCP keepalives. It sounds like that is not the case anymore. When I did my testing was over a year and a half ago though. > One can argue that this is a broken server and we shouldn't bother. > But this patch is an attempt to argue that the client still should > care and deal with this condition. However, if the community feels > that a broken server is a broken server and this form of an attack is > not interested, this patch can live will be an archive for later or > never. > This isn't IPoIB is it? Actually, fwiw, looking back I had speculated on changes in this area. I'm adding you to the CC list of this bug which had some of my musings on it: https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 That bug I ended up closing when we could no longer prove there was any state where the NFS client could get stuck in FIN_WAIT2 after the keepalive patch. It can happen that the server only sends the ACK back to the clients FIN,ACK so in general the testcase is valid. But then the question is how long should one wait for the final data and FIN from the server, or are there ever instances where you shouldn't wait forever? Is there a way for us to know for sure there is no data left to receive so it's safe to timeout? No RPCs outstanding? I don't claim to know many of the subtleties here as far as would the server wait forever in LAST_ACK or do implementations eventually timeout after some time? Seems like if these last packets get lost there is likely a bug somewhere (either firewall or TCP stack, etc). https://tools.ietf.org/html/rfc793#page-22 It looks like at least some people are putting timeouts into their stacks though I'm not sure that's a good idea or not.
On Fri, Feb 22, 2019 at 10:50 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote: > > On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust > > <trondmy@hammerspace.com> wrote: > > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote: > > > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust < > > > > trondmy@hammerspace.com> wrote: > > > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > > > > > Hi Olga, > > > > > > > > > > > > Do you have a reproducer for this? A number of months ago I > > > > > > did > > > > > > a > > > > > > significant amount of testing with half-closed connections, > > > > > > after > > > > > > we > > > > > > had reports of connections stuck in FIN_WAIT2 in some older > > > > > > kernels. > > > > > > What I found was with kernels that had the tcp keepalives > > > > > > (commit > > > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only > > > > > > reproduce > > > > > > a > > > > > > hang of a few minutes, after which time the tcp keepalive > > > > > > code > > > > > > would > > > > > > reset the connection. > > > > > > > > > > > > That said it was a while ago and something subtle may have > > > > > > changed. > > > > > > Also I'm not not sure if your header implies an indefinite > > > > > > hang > > > > > > or > > > > > > just > > > > > > a few minutes. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote: > > > > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > > > > > > > When server replies with an ACK to client's FIN/ACK, client > > > > > > > ends > > > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs. > > > > > > > Instead, make sure to close and reset client's socket and > > > > > > > transport > > > > > > > when transitioned into that state. > > > > > > > > Hi Trond, > > > > > > > > > So, please do note that we do not want to ignore the FIN_WAIT2 > > > > > state > > > > > > > > But we do ignore the FIN_WAIT2 state. > > > > > > We do not. We wait for the server to send a FIN, which is precisely > > > the > > > reason for which FIN_WAIT2 exists. > > > > > > > > because it implies that the server has not closed the socket on > > > > > its > > > > > side. > > > > > > > > That's correct. > > > > > > > > > That again means that we cannot re-establish a connection using > > > > > the same source IP+port to the server, which is problematic for > > > > > protocols such as NFSv3 which rely on standard duplicate reply > > > > > cache > > > > > for correct replay semantics. > > > > > > > > that's exactly what's happening that a client is unable to > > > > establish > > > > a > > > > new connection to the server. With the patch, the client does an > > > > RST > > > > and it re-uses the port and all is well for NFSv3. > > > > > > RST is not guaranteed to be delivered to the recipient. That's why > > > the > > > TCP protocol defines FIN: it is a guaranteed to be delivered > > > because it > > > is ACKed. > > > > > > > > This is why we don't just set the TCP_LINGER2 socket option and > > > > > call > > > > > sock_release(). The choice to try to wait it out is deliberate > > > > > because > > > > > the alternative is that we end up with busy-waiting re- > > > > > connection > > > > > attempts. > > > > > > > > Why would it busy-wait? In my testing, RST happens and new > > > > connection > > > > is established? > > > > > > Only if the server has dropped the connection without notifying the > > > client. > > > > Yes the server dropped the connection without notifying the client > > (or > > perhaps something in the middle did it as an attack). Again, I raise > > this concern for the sake of dealing with this as an attack. I have > > no > > intentions of catering to broken servers. If this is not a possible > > attack, then we don't have to deal with it. > > A man in the middle might be able to intercept the FIN from the server > and ACK it, causing the connection to be closed on that server. > However, as Dave pointed out, why wouldn't the keepalive mechanism then > eventually kick in and close the socket on the client as well? The mechanism is already kicked in and got stuck in FIN_WAIT2. NFS connection was idle, so TCP layer was sending keep-alives. Then it sent a FIN/ACK to which the server replied with just an ACK. Kernel notified NFS that we are in FIN_WAIT2 and I believe it is NFS responsibility to act accordingly. Kernel then keeps sending "keep-alives" forever. Because of this code: case TCP_FIN_WAIT1: case TCP_FIN_WAIT2: /* RFC 793 says to queue data in these states, * RFC 1122 says we MUST send a reset. * BSD 4.4 also does reset. */ if (sk->sk_shutdown & RCV_SHUTDOWN) { if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq && after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); tcp_reset(sk); << this is never triggered return 1; } } In our case TCP_SKB_CB(skb)->end_seq always equals TCP_SKB_CB(skb)->seq. (No i don't know the meaning of end_seq and seq :-/). > If the FIN is not ACKed, then the server is supposed to keep > retransmitting it. Until that ACK is received, it cannot close the > socket without violating the TCP protocol. Something in the middle can keep intercepting the the FIN/ACK from the server and keep sending an ACK back? > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Fri, 2019-02-22 at 12:02 -0500, Olga Kornievskaia wrote: > On Fri, Feb 22, 2019 at 10:50 AM Trond Myklebust > <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote: > > > On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust > > > <trondmy@hammerspace.com> wrote: > > > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote: > > > > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust < > > > > > trondmy@hammerspace.com> wrote: > > > > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote: > > > > > > > Hi Olga, > > > > > > > > > > > > > > Do you have a reproducer for this? A number of months > > > > > > > ago I > > > > > > > did > > > > > > > a > > > > > > > significant amount of testing with half-closed > > > > > > > connections, > > > > > > > after > > > > > > > we > > > > > > > had reports of connections stuck in FIN_WAIT2 in some > > > > > > > older > > > > > > > kernels. > > > > > > > What I found was with kernels that had the tcp keepalives > > > > > > > (commit > > > > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only > > > > > > > reproduce > > > > > > > a > > > > > > > hang of a few minutes, after which time the tcp keepalive > > > > > > > code > > > > > > > would > > > > > > > reset the connection. > > > > > > > > > > > > > > That said it was a while ago and something subtle may > > > > > > > have > > > > > > > changed. > > > > > > > Also I'm not not sure if your header implies an > > > > > > > indefinite > > > > > > > hang > > > > > > > or > > > > > > > just > > > > > > > a few minutes. > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia > > > > > > > wrote: > > > > > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > > > > > > > > > When server replies with an ACK to client's FIN/ACK, > > > > > > > > client > > > > > > > > ends > > > > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount > > > > > > > > hangs. > > > > > > > > Instead, make sure to close and reset client's socket > > > > > > > > and > > > > > > > > transport > > > > > > > > when transitioned into that state. > > > > > > > > > > Hi Trond, > > > > > > > > > > > So, please do note that we do not want to ignore the > > > > > > FIN_WAIT2 > > > > > > state > > > > > > > > > > But we do ignore the FIN_WAIT2 state. > > > > > > > > We do not. We wait for the server to send a FIN, which is > > > > precisely > > > > the > > > > reason for which FIN_WAIT2 exists. > > > > > > > > > > because it implies that the server has not closed the > > > > > > socket on > > > > > > its > > > > > > side. > > > > > > > > > > That's correct. > > > > > > > > > > > That again means that we cannot re-establish a connection > > > > > > using > > > > > > the same source IP+port to the server, which is problematic > > > > > > for > > > > > > protocols such as NFSv3 which rely on standard duplicate > > > > > > reply > > > > > > cache > > > > > > for correct replay semantics. > > > > > > > > > > that's exactly what's happening that a client is unable to > > > > > establish > > > > > a > > > > > new connection to the server. With the patch, the client does > > > > > an > > > > > RST > > > > > and it re-uses the port and all is well for NFSv3. > > > > > > > > RST is not guaranteed to be delivered to the recipient. That's > > > > why > > > > the > > > > TCP protocol defines FIN: it is a guaranteed to be delivered > > > > because it > > > > is ACKed. > > > > > > > > > > This is why we don't just set the TCP_LINGER2 socket option > > > > > > and > > > > > > call > > > > > > sock_release(). The choice to try to wait it out is > > > > > > deliberate > > > > > > because > > > > > > the alternative is that we end up with busy-waiting re- > > > > > > connection > > > > > > attempts. > > > > > > > > > > Why would it busy-wait? In my testing, RST happens and new > > > > > connection > > > > > is established? > > > > > > > > Only if the server has dropped the connection without notifying > > > > the > > > > client. > > > > > > Yes the server dropped the connection without notifying the > > > client > > > (or > > > perhaps something in the middle did it as an attack). Again, I > > > raise > > > this concern for the sake of dealing with this as an attack. I > > > have > > > no > > > intentions of catering to broken servers. If this is not a > > > possible > > > attack, then we don't have to deal with it. > > > > A man in the middle might be able to intercept the FIN from the > > server > > and ACK it, causing the connection to be closed on that server. > > However, as Dave pointed out, why wouldn't the keepalive mechanism > > then > > eventually kick in and close the socket on the client as well? > > The mechanism is already kicked in and got stuck in FIN_WAIT2. NFS > connection was idle, so TCP layer was sending keep-alives. Then it > sent a FIN/ACK to which the server replied with just an ACK. Kernel > notified NFS that we are in FIN_WAIT2 and I believe it is NFS > responsibility to act accordingly. Kernel then keeps sending > "keep-alives" forever. Because of this code: > > case TCP_FIN_WAIT1: > case TCP_FIN_WAIT2: > /* RFC 793 says to queue data in these states, > * RFC 1122 says we MUST send a reset. > * BSD 4.4 also does reset. > */ > if (sk->sk_shutdown & RCV_SHUTDOWN) { > if (TCP_SKB_CB(skb)->end_seq != > TCP_SKB_CB(skb)->seq && > after(TCP_SKB_CB(skb)->end_seq - th->fin, > tp->rcv_nxt)) { > NET_INC_STATS(sock_net(sk), > LINUX_MIB_TCPABORTONDATA); > tcp_reset(sk); << this is never > triggered > return 1; > } > } > > In our case TCP_SKB_CB(skb)->end_seq always equals > TCP_SKB_CB(skb)->seq. (No i don't know the meaning of end_seq and seq > :-/). Right, but if the connection is closed on the server, then it should be sending RST replies to all these keepalives, one of which will presumably eventually reach the client. > > If the FIN is not ACKed, then the server is supposed to keep > > retransmitting it. Until that ACK is received, it cannot close the > > socket without violating the TCP protocol. > > Something in the middle can keep intercepting the the FIN/ACK from > the > server and keep sending an ACK back? Sure, but if can do that (which would entail being able to guess the TCP segment seq nums) it can also be in a position to generally mess with the TCP connection. Why do we care about that case?
On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@redhat.com> wrote: > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > Hi Dave, > > > > A re-producer is a server that sends an ACK to the client's FIN/ACK > > request and does nothing afterwards (I can reproduce it 100% with a > > hacked up server. It was discovered with a "broken" server that > > doesn't fully closes a connection). It leave this client unable to > > connect to this server again indefinitely/forever/reboot required > > kind > > of state. Once it was considered that doing something like that to > > the > > client is a form of an attack (denial-of-server) and thus the kernel > > has a tcp_fin_timeout option after which the kernel will abort the > > connection. However this only applies to the sockets that have been > > closed by the client. This is NOT the case. NFS does not close the > > connection and it ignores kernel's notification of FIN_WAIT2 state. > > > Interesting. I had the same reproducer but eventually an RST would get > sent from the NFS client due to the TCP keepalives. It sounds like > that is not the case anymore. When I did my testing was over a year > and a half ago though. after the keepalives the TCP also sends a FIN/ACK if the server properly sent a FIN/ACK back, then keep alive will indeed be sufficient. Can you check if in your trace server in one time sent just an ACK but in another case sent the FIN/ACK? > > One can argue that this is a broken server and we shouldn't bother. > > But this patch is an attempt to argue that the client still should > > care and deal with this condition. However, if the community feels > > that a broken server is a broken server and this form of an attack is > > not interested, this patch can live will be an archive for later or > > never. > > > > This isn't IPoIB is it? No, just normal TCP. > Actually, fwiw, looking back I had speculated on changes in this area. > I'm adding you to the CC list of this bug which had some of my musings > on it: > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > That bug I ended up closing when we could no longer prove there was any > state where the NFS client could get stuck in FIN_WAIT2 after the > keepalive patch. I can reproduce current behavior with the current upstream code. > It can happen that the server only sends the ACK back to the clients > FIN,ACK so in general the testcase is valid. But then the question is > how long should one wait for the final data and FIN from the server, or > are there ever instances where you shouldn't wait forever? Is there a > way for us to know for sure there is no data left to receive so it's > safe to timeout? No RPCs outstanding? Yep all good questions which I don't have answers to. I realize that for instance, that a server might send an ACK and then send a FIN/ACK after that. But why is it bad for the client to proactively send an RST (by shutting down the connection in TCP_FIN_WAIT2 if the client was shutting down the connection anyway. > I don't claim to know many of the subtleties here as far as would the > server wait forever in LAST_ACK or do implementations eventually > timeout after some time? Seems like if these last packets get lost > there is likely a bug somewhere (either firewall or TCP stack, etc). > https://tools.ietf.org/html/rfc793#page-22 > > It looks like at least some people are putting timeouts into their > stacks though I'm not sure that's a good idea or not. I saw only one other driver in the kernel that does have handling of the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@redhat.co > m> wrote: > > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > Hi Dave, > > > > > > A re-producer is a server that sends an ACK to the client's > > > FIN/ACK > > > request and does nothing afterwards (I can reproduce it 100% with > > > a > > > hacked up server. It was discovered with a "broken" server that > > > doesn't fully closes a connection). It leave this client unable > > > to > > > connect to this server again indefinitely/forever/reboot required > > > kind > > > of state. Once it was considered that doing something like that > > > to > > > the > > > client is a form of an attack (denial-of-server) and thus the > > > kernel > > > has a tcp_fin_timeout option after which the kernel will abort > > > the > > > connection. However this only applies to the sockets that have > > > been > > > closed by the client. This is NOT the case. NFS does not close > > > the > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > state. > > > > > > > Interesting. I had the same reproducer but eventually an RST would > > get > > sent from the NFS client due to the TCP keepalives. It sounds like > > that is not the case anymore. When I did my testing was over a > > year > > and a half ago though. > > after the keepalives the TCP also sends a FIN/ACK if the server > properly sent a FIN/ACK back, then keep alive will indeed be > sufficient. Can you check if in your trace server in one time sent > just an ACK but in another case sent the FIN/ACK? > I had two reproducers actually. In both cases the NFS server would never send a FIN. The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) was about a NFS server crash after being in the half-close state. An NFS4 client without that keepalive patch would hang. This is a valid test case and we check for that now in all releases. Here's the outline: # Outline # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER # Step 2. On NFS client, mount NFS4 server export with "vers=4,timeo=5" # Step 3. On NFS server, block outgoing FIN packets from the NFS server" # Step 4. On NFS server, using systemtap script, to delay NFS4OPEN_CONFIRM response for a few seconds # Step 5. On NFS client, open a file, which should get delayed and trigger a FIN from the NFS client # Step 6. On NFS server, after a short delay, crash the server by 'echo c > /proc/sysrq-trigger' # Step 7. On NFS client, wait for NFS server to reboot # Step 8. On NFS client, wait for the NFS to reconnect TCP connection # Step 9. On NFS client, wait for NFS4 grace period # Step 10. On NFS client, try a 'ls' of the file created earlier # Step 11. On NFS client, sleep for $DELAY seconds to allow all operations to complete # Step 12. On NFS client, ensure all operations have completed # Step 13. Cleanup The second test case (after the keepalive patch) was arguably invalid test as I used systemtap on the NFS server to never close the socket so it would never send a FIN. This obviously should never happen but at the time I wanted to see if the NFS client would recover. I did that with this: probe module("sunrpc").function("svc_xprt_enqueue") { printf("%s: %s removing XPT_CLOSE from xprt = %p\n", tz_ctime(gettimeofday_s()), ppfunc(), $xprt); $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); } Here's the full outline for this test: # This test checks automatic NFS TCP recovery after a specific failure scenario. # The failure scenario is with an NFSv3 mount that goes idle and the NFS client closes the # TCP connection, and the final FIN from the NFS server gets lost. # This can be a tricky scenario since the NFS client's TCP may get stuck in FIN-WAIT-2 indefinitely. # The NFS client should be able to recover the NFS TCP connection automatically without unmount/mount # cycle or reboot, and the TCP connection should not be stuck in FIN-WAIT-2 or any other non-connected # state. # # Outline # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER # Step 2. On NFS client, mount NFS3 server export # Step 3. On NFS server, using systemtap script, force the NFS server to never send a FIN (avoid close) # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP connection in TCP_ESTABLISHED # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to transition to TIME_WAIT state # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to disappear # Step 7. Cleanup # # PASS: The state of NFS's TCP connection is usable by the end of the test (commands don't hang, etc) # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 indefinitely > > > One can argue that this is a broken server and we shouldn't > > > bother. > > > But this patch is an attempt to argue that the client still > > > should > > > care and deal with this condition. However, if the community > > > feels > > > that a broken server is a broken server and this form of an > > > attack is > > > not interested, this patch can live will be an archive for later > > > or > > > never. > > > > > > > This isn't IPoIB is it? > > No, just normal TCP. > > > > Actually, fwiw, looking back I had speculated on changes in this > > area. > > I'm adding you to the CC list of this bug which had some of my > > musings > > on it: > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > That bug I ended up closing when we could no longer prove there was > > any > > state where the NFS client could get stuck in FIN_WAIT2 after the > > keepalive patch. > > I can reproduce current behavior with the current upstream code. > > > > It can happen that the server only sends the ACK back to the > > clients > > FIN,ACK so in general the testcase is valid. But then the question > > is > > how long should one wait for the final data and FIN from the > > server, or > > are there ever instances where you shouldn't wait forever? Is > > there a > > way for us to know for sure there is no data left to receive so > > it's > > safe to timeout? No RPCs outstanding? > > Yep all good questions which I don't have answers to. I realize that > for instance, that a server might send an ACK and then send a FIN/ACK > after that. But why is it bad for the client to proactively send an > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client > was shutting down the connection anyway. > I think you could lose data from the server if you RST and don't wait but other than that I don't know. We may be splitting hairs here though if there's no demonstrable harm to the application (NFS). As Trond points out elsewhere reconnect / port reuse may be an issue. When I looked at this in the second bug I was almost convinced that there was a problem with the 'close' method in xs_tcp_ops - we needed to be calling xs_close(), but I ran into some problems and I wasn't sure about the test case. > > I don't claim to know many of the subtleties here as far as would > > the > > server wait forever in LAST_ACK or do implementations eventually > > timeout after some time? Seems like if these last packets get lost > > there is likely a bug somewhere (either firewall or TCP stack, > > etc). > > https://tools.ietf.org/html/rfc793#page-22 > > > > It looks like at least some people are putting timeouts into their > > stacks though I'm not sure that's a good idea or not. > > I saw only one other driver in the kernel that does have handling of > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... One final thought - is it possible that we could have a network that does not support TCP keepalives? In that instance, I'll bet we could get an indefinite hang in FIN_WAIT2 on my first test case (server crashed after the half-close state). It does not look like TCP keepalives have to be implemented and maybe some routers / NATs / firewalls disallow them (due to extra traffic)? https://tools.ietf.org/html/rfc1122#page-101
On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote: > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski < > > dwysocha@redhat.co > > m> wrote: > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > > Hi Dave, > > > > > > > > A re-producer is a server that sends an ACK to the client's > > > > FIN/ACK > > > > request and does nothing afterwards (I can reproduce it 100% > > > > with > > > > a > > > > hacked up server. It was discovered with a "broken" server that > > > > doesn't fully closes a connection). It leave this client unable > > > > to > > > > connect to this server again indefinitely/forever/reboot > > > > required > > > > kind > > > > of state. Once it was considered that doing something like that > > > > to > > > > the > > > > client is a form of an attack (denial-of-server) and thus the > > > > kernel > > > > has a tcp_fin_timeout option after which the kernel will abort > > > > the > > > > connection. However this only applies to the sockets that have > > > > been > > > > closed by the client. This is NOT the case. NFS does not close > > > > the > > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > > state. > > > > > > > > > > Interesting. I had the same reproducer but eventually an RST > > > would > > > get > > > sent from the NFS client due to the TCP keepalives. It sounds > > > like > > > that is not the case anymore. When I did my testing was over a > > > year > > > and a half ago though. > > > > after the keepalives the TCP also sends a FIN/ACK if the server > > properly sent a FIN/ACK back, then keep alive will indeed be > > sufficient. Can you check if in your trace server in one time sent > > just an ACK but in another case sent the FIN/ACK? > > > > I had two reproducers actually. In both cases the NFS server would > never send a FIN. > > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) > was > about a NFS server crash after being in the half-close state. An > NFS4 > client without that keepalive patch would hang. > This is a valid test case and we check for that now in all releases. > Here's the outline: > # Outline > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > 0 port 2049 and host $NFS_SERVER > # Step 2. On NFS client, mount NFS4 server export with > "vers=4,timeo=5" > # Step 3. On NFS server, block outgoing FIN packets from the NFS > server" > # Step 4. On NFS server, using systemtap script, to delay > NFS4OPEN_CONFIRM response for a few seconds > # Step 5. On NFS client, open a file, which should get delayed and > trigger a FIN from the NFS client > # Step 6. On NFS server, after a short delay, crash the server by > 'echo c > /proc/sysrq-trigger' > # Step 7. On NFS client, wait for NFS server to reboot > # Step 8. On NFS client, wait for the NFS to reconnect TCP connection > # Step 9. On NFS client, wait for NFS4 grace period > # Step 10. On NFS client, try a 'ls' of the file created earlier > # Step 11. On NFS client, sleep for $DELAY seconds to allow all > operations to complete > # Step 12. On NFS client, ensure all operations have completed > # Step 13. Cleanup > > > The second test case (after the keepalive patch) was arguably invalid > test as I used systemtap on the NFS server to never close the socket > so > it would never send a FIN. This obviously should never happen but at > the time I wanted to see if the NFS client would recover. > I did that with this: > probe module("sunrpc").function("svc_xprt_enqueue") { > printf("%s: %s removing XPT_CLOSE from xprt = %p\n", > tz_ctime(gettimeofday_s()), ppfunc(), $xprt); > $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); > } > Here's the full outline for this test: > > # This test checks automatic NFS TCP recovery after a specific > failure scenario. > # The failure scenario is with an NFSv3 mount that goes idle and the > NFS client closes the > # TCP connection, and the final FIN from the NFS server gets lost. > # This can be a tricky scenario since the NFS client's TCP may get > stuck in FIN-WAIT-2 indefinitely. > # The NFS client should be able to recover the NFS TCP connection > automatically without unmount/mount > # cycle or reboot, and the TCP connection should not be stuck in FIN- > WAIT-2 or any other non-connected > # state. > # > # Outline > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > 0 port 2049 and host $NFS_SERVER > # Step 2. On NFS client, mount NFS3 server export > # Step 3. On NFS server, using systemtap script, force the NFS server > to never send a FIN (avoid close) > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP > connection in TCP_ESTABLISHED > # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS > TCP connection to transition to TIME_WAIT state > # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS > TCP connection to disappear > # Step 7. Cleanup > # > # PASS: The state of NFS's TCP connection is usable by the end of the > test (commands don't hang, etc) > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 > indefinitely > > > > > > One can argue that this is a broken server and we shouldn't > > > > bother. > > > > But this patch is an attempt to argue that the client still > > > > should > > > > care and deal with this condition. However, if the community > > > > feels > > > > that a broken server is a broken server and this form of an > > > > attack is > > > > not interested, this patch can live will be an archive for > > > > later > > > > or > > > > never. > > > > > > > > > > This isn't IPoIB is it? > > > > No, just normal TCP. > > > > > > > Actually, fwiw, looking back I had speculated on changes in this > > > area. > > > I'm adding you to the CC list of this bug which had some of my > > > musings > > > on it: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > > That bug I ended up closing when we could no longer prove there > > > was > > > any > > > state where the NFS client could get stuck in FIN_WAIT2 after the > > > keepalive patch. > > > > I can reproduce current behavior with the current upstream code. > > > > > > > It can happen that the server only sends the ACK back to the > > > clients > > > FIN,ACK so in general the testcase is valid. But then the > > > question > > > is > > > how long should one wait for the final data and FIN from the > > > server, or > > > are there ever instances where you shouldn't wait forever? Is > > > there a > > > way for us to know for sure there is no data left to receive so > > > it's > > > safe to timeout? No RPCs outstanding? > > > > Yep all good questions which I don't have answers to. I realize > > that > > for instance, that a server might send an ACK and then send a > > FIN/ACK > > after that. But why is it bad for the client to proactively send an > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client > > was shutting down the connection anyway. > > > I think you could lose data from the server if you RST and don't wait > but other than that I don't know. We may be splitting hairs here > though if there's no demonstrable harm to the application (NFS). As > Trond points out elsewhere reconnect / port reuse may be an issue. > > When I looked at this in the second bug I was almost convinced that > there was a problem with the 'close' method in xs_tcp_ops - we needed > to be calling xs_close(), but I ran into some problems and I wasn't > sure about the test case. > > > > > I don't claim to know many of the subtleties here as far as would > > > the > > > server wait forever in LAST_ACK or do implementations eventually > > > timeout after some time? Seems like if these last packets get > > > lost > > > there is likely a bug somewhere (either firewall or TCP stack, > > > etc). > > > https://tools.ietf.org/html/rfc793#page-22 > > > > > > It looks like at least some people are putting timeouts into > > > their > > > stacks though I'm not sure that's a good idea or not. > > > > I saw only one other driver in the kernel that does have handling > > of > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... > > One final thought - is it possible that we could have a network that > does not support TCP keepalives? In that instance, I'll bet we could > get an indefinite hang in FIN_WAIT2 on my first test case (server > crashed after the half-close state). > > It does not look like TCP keepalives have to be implemented and maybe > some routers / NATs / firewalls disallow them (due to extra traffic)? > https://tools.ietf.org/html/rfc1122#page-101 Again, that's a corner case of a corner case. I've never seen any reports of this occurring. On the other hand, I _have_ seen lots of reports of problems due to the reconnection issues when the server still holds the socket open. Those reports were the main reason for the current design.
On Fri, 2019-02-22 at 20:00 +0000, Trond Myklebust wrote: > On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote: > > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski < > > > dwysocha@redhat.co > > > m> wrote: > > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > > > Hi Dave, > > > > > > > > > > A re-producer is a server that sends an ACK to the client's > > > > > FIN/ACK > > > > > request and does nothing afterwards (I can reproduce it 100% > > > > > with > > > > > a > > > > > hacked up server. It was discovered with a "broken" server > > > > > that > > > > > doesn't fully closes a connection). It leave this client > > > > > unable > > > > > to > > > > > connect to this server again indefinitely/forever/reboot > > > > > required > > > > > kind > > > > > of state. Once it was considered that doing something like > > > > > that > > > > > to > > > > > the > > > > > client is a form of an attack (denial-of-server) and thus the > > > > > kernel > > > > > has a tcp_fin_timeout option after which the kernel will > > > > > abort > > > > > the > > > > > connection. However this only applies to the sockets that > > > > > have > > > > > been > > > > > closed by the client. This is NOT the case. NFS does not > > > > > close > > > > > the > > > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > > > state. > > > > > > > > > > > > > Interesting. I had the same reproducer but eventually an RST > > > > would > > > > get > > > > sent from the NFS client due to the TCP keepalives. It sounds > > > > like > > > > that is not the case anymore. When I did my testing was over a > > > > year > > > > and a half ago though. > > > > > > after the keepalives the TCP also sends a FIN/ACK if the server > > > properly sent a FIN/ACK back, then keep alive will indeed be > > > sufficient. Can you check if in your trace server in one time > > > sent > > > just an ACK but in another case sent the FIN/ACK? > > > > > > > I had two reproducers actually. In both cases the NFS server would > > never send a FIN. > > > > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) > > was > > about a NFS server crash after being in the half-close state. An > > NFS4 > > client without that keepalive patch would hang. > > This is a valid test case and we check for that now in all > > releases. > > Here's the outline: > > # Outline > > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE > > -s > > 0 port 2049 and host $NFS_SERVER > > # Step 2. On NFS client, mount NFS4 server export with > > "vers=4,timeo=5" > > # Step 3. On NFS server, block outgoing FIN packets from the NFS > > server" > > # Step 4. On NFS server, using systemtap script, to delay > > NFS4OPEN_CONFIRM response for a few seconds > > # Step 5. On NFS client, open a file, which should get delayed and > > trigger a FIN from the NFS client > > # Step 6. On NFS server, after a short delay, crash the server by > > 'echo c > /proc/sysrq-trigger' > > # Step 7. On NFS client, wait for NFS server to reboot > > # Step 8. On NFS client, wait for the NFS to reconnect TCP > > connection > > # Step 9. On NFS client, wait for NFS4 grace period > > # Step 10. On NFS client, try a 'ls' of the file created earlier > > # Step 11. On NFS client, sleep for $DELAY seconds to allow all > > operations to complete > > # Step 12. On NFS client, ensure all operations have completed > > # Step 13. Cleanup > > > > > > The second test case (after the keepalive patch) was arguably > > invalid > > test as I used systemtap on the NFS server to never close the > > socket > > so > > it would never send a FIN. This obviously should never happen but > > at > > the time I wanted to see if the NFS client would recover. > > I did that with this: > > probe module("sunrpc").function("svc_xprt_enqueue") { > > printf("%s: %s removing XPT_CLOSE from xprt = %p\n", > > tz_ctime(gettimeofday_s()), ppfunc(), $xprt); > > $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); > > } > > Here's the full outline for this test: > > > > # This test checks automatic NFS TCP recovery after a specific > > failure scenario. > > # The failure scenario is with an NFSv3 mount that goes idle and > > the > > NFS client closes the > > # TCP connection, and the final FIN from the NFS server gets lost. > > # This can be a tricky scenario since the NFS client's TCP may get > > stuck in FIN-WAIT-2 indefinitely. > > # The NFS client should be able to recover the NFS TCP connection > > automatically without unmount/mount > > # cycle or reboot, and the TCP connection should not be stuck in > > FIN- > > WAIT-2 or any other non-connected > > # state. > > # > > # Outline > > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE > > -s > > 0 port 2049 and host $NFS_SERVER > > # Step 2. On NFS client, mount NFS3 server export > > # Step 3. On NFS server, using systemtap script, force the NFS > > server > > to never send a FIN (avoid close) > > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP > > connection in TCP_ESTABLISHED > > # Step 5. On NFS client, wait up to $time_limit seconds to allow > > NFS > > TCP connection to transition to TIME_WAIT state > > # Step 6. On NFS client, wait up to $time_limit seconds to allow > > NFS > > TCP connection to disappear > > # Step 7. Cleanup > > # > > # PASS: The state of NFS's TCP connection is usable by the end of > > the > > test (commands don't hang, etc) > > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 > > indefinitely > > > > > > > > > One can argue that this is a broken server and we shouldn't > > > > > bother. > > > > > But this patch is an attempt to argue that the client still > > > > > should > > > > > care and deal with this condition. However, if the community > > > > > feels > > > > > that a broken server is a broken server and this form of an > > > > > attack is > > > > > not interested, this patch can live will be an archive for > > > > > later > > > > > or > > > > > never. > > > > > > > > > > > > > This isn't IPoIB is it? > > > > > > No, just normal TCP. > > > > > > > > > > Actually, fwiw, looking back I had speculated on changes in > > > > this > > > > area. > > > > I'm adding you to the CC list of this bug which had some of my > > > > musings > > > > on it: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > > > That bug I ended up closing when we could no longer prove there > > > > was > > > > any > > > > state where the NFS client could get stuck in FIN_WAIT2 after > > > > the > > > > keepalive patch. > > > > > > I can reproduce current behavior with the current upstream code. > > > > > > > > > > It can happen that the server only sends the ACK back to the > > > > clients > > > > FIN,ACK so in general the testcase is valid. But then the > > > > question > > > > is > > > > how long should one wait for the final data and FIN from the > > > > server, or > > > > are there ever instances where you shouldn't wait forever? Is > > > > there a > > > > way for us to know for sure there is no data left to receive so > > > > it's > > > > safe to timeout? No RPCs outstanding? > > > > > > Yep all good questions which I don't have answers to. I realize > > > that > > > for instance, that a server might send an ACK and then send a > > > FIN/ACK > > > after that. But why is it bad for the client to proactively send > > > an > > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the > > > client > > > was shutting down the connection anyway. > > > > > > > I think you could lose data from the server if you RST and don't > > wait > > but other than that I don't know. We may be splitting hairs here > > though if there's no demonstrable harm to the application > > (NFS). As > > Trond points out elsewhere reconnect / port reuse may be an issue. > > > > When I looked at this in the second bug I was almost convinced that > > there was a problem with the 'close' method in xs_tcp_ops - we > > needed > > to be calling xs_close(), but I ran into some problems and I wasn't > > sure about the test case. > > > > > > > > I don't claim to know many of the subtleties here as far as > > > > would > > > > the > > > > server wait forever in LAST_ACK or do implementations > > > > eventually > > > > timeout after some time? Seems like if these last packets get > > > > lost > > > > there is likely a bug somewhere (either firewall or TCP stack, > > > > etc). > > > > https://tools.ietf.org/html/rfc793#page-22 > > > > > > > > It looks like at least some people are putting timeouts into > > > > their > > > > stacks though I'm not sure that's a good idea or not. > > > > > > I saw only one other driver in the kernel that does have handling > > > of > > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... > > > > One final thought - is it possible that we could have a network > > that > > does not support TCP keepalives? In that instance, I'll bet we > > could > > get an indefinite hang in FIN_WAIT2 on my first test case (server > > crashed after the half-close state). > > > > It does not look like TCP keepalives have to be implemented and > > maybe > > some routers / NATs / firewalls disallow them (due to extra > > traffic)? > > https://tools.ietf.org/html/rfc1122#page-101 > > > Again, that's a corner case of a corner case. I've never seen any > reports of this occurring. > I agree neither have I, and due to the issue about a year and a half ago I'm confident I would have known about any stuck NFS due to FIN_WAIT2 problems. Though my experience of course centers around RHEL7 (3.10 based) and RHEL8 beta (4.18 based) so it's not the same bits as the you and others on the list. I thought about the "network that does not support keepalives" some more and for various reasons (not sure it exists, even if it does that network would be asking for hangs, etc) I don't think it is an important case to consider. > On the other hand, I _have_ seen lots of reports of problems due to > the > reconnection issues when the server still holds the socket open. > Those > reports were the main reason for the current design. > > Yes I agree reconnect issues happen a lot more than I would have thought so I would never want any change for a corner case of corner case that wasn't truly valid. Thanks.
On Fri, Feb 22, 2019 at 3:00 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote: > > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski < > > > dwysocha@redhat.co > > > m> wrote: > > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > > > Hi Dave, > > > > > > > > > > A re-producer is a server that sends an ACK to the client's > > > > > FIN/ACK > > > > > request and does nothing afterwards (I can reproduce it 100% > > > > > with > > > > > a > > > > > hacked up server. It was discovered with a "broken" server that > > > > > doesn't fully closes a connection). It leave this client unable > > > > > to > > > > > connect to this server again indefinitely/forever/reboot > > > > > required > > > > > kind > > > > > of state. Once it was considered that doing something like that > > > > > to > > > > > the > > > > > client is a form of an attack (denial-of-server) and thus the > > > > > kernel > > > > > has a tcp_fin_timeout option after which the kernel will abort > > > > > the > > > > > connection. However this only applies to the sockets that have > > > > > been > > > > > closed by the client. This is NOT the case. NFS does not close > > > > > the > > > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > > > state. > > > > > > > > > > > > > Interesting. I had the same reproducer but eventually an RST > > > > would > > > > get > > > > sent from the NFS client due to the TCP keepalives. It sounds > > > > like > > > > that is not the case anymore. When I did my testing was over a > > > > year > > > > and a half ago though. > > > > > > after the keepalives the TCP also sends a FIN/ACK if the server > > > properly sent a FIN/ACK back, then keep alive will indeed be > > > sufficient. Can you check if in your trace server in one time sent > > > just an ACK but in another case sent the FIN/ACK? > > > > > > > I had two reproducers actually. In both cases the NFS server would > > never send a FIN. > > > > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) > > was > > about a NFS server crash after being in the half-close state. An > > NFS4 > > client without that keepalive patch would hang. > > This is a valid test case and we check for that now in all releases. > > Here's the outline: > > # Outline > > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > > 0 port 2049 and host $NFS_SERVER > > # Step 2. On NFS client, mount NFS4 server export with > > "vers=4,timeo=5" > > # Step 3. On NFS server, block outgoing FIN packets from the NFS > > server" > > # Step 4. On NFS server, using systemtap script, to delay > > NFS4OPEN_CONFIRM response for a few seconds > > # Step 5. On NFS client, open a file, which should get delayed and > > trigger a FIN from the NFS client > > # Step 6. On NFS server, after a short delay, crash the server by > > 'echo c > /proc/sysrq-trigger' > > # Step 7. On NFS client, wait for NFS server to reboot > > # Step 8. On NFS client, wait for the NFS to reconnect TCP connection > > # Step 9. On NFS client, wait for NFS4 grace period > > # Step 10. On NFS client, try a 'ls' of the file created earlier > > # Step 11. On NFS client, sleep for $DELAY seconds to allow all > > operations to complete > > # Step 12. On NFS client, ensure all operations have completed > > # Step 13. Cleanup > > > > > > The second test case (after the keepalive patch) was arguably invalid > > test as I used systemtap on the NFS server to never close the socket > > so > > it would never send a FIN. This obviously should never happen but at > > the time I wanted to see if the NFS client would recover. > > I did that with this: > > probe module("sunrpc").function("svc_xprt_enqueue") { > > printf("%s: %s removing XPT_CLOSE from xprt = %p\n", > > tz_ctime(gettimeofday_s()), ppfunc(), $xprt); > > $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); > > } > > Here's the full outline for this test: > > > > # This test checks automatic NFS TCP recovery after a specific > > failure scenario. > > # The failure scenario is with an NFSv3 mount that goes idle and the > > NFS client closes the > > # TCP connection, and the final FIN from the NFS server gets lost. > > # This can be a tricky scenario since the NFS client's TCP may get > > stuck in FIN-WAIT-2 indefinitely. > > # The NFS client should be able to recover the NFS TCP connection > > automatically without unmount/mount > > # cycle or reboot, and the TCP connection should not be stuck in FIN- > > WAIT-2 or any other non-connected > > # state. > > # > > # Outline > > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > > 0 port 2049 and host $NFS_SERVER > > # Step 2. On NFS client, mount NFS3 server export > > # Step 3. On NFS server, using systemtap script, force the NFS server > > to never send a FIN (avoid close) > > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP > > connection in TCP_ESTABLISHED > > # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS > > TCP connection to transition to TIME_WAIT state > > # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS > > TCP connection to disappear > > # Step 7. Cleanup > > # > > # PASS: The state of NFS's TCP connection is usable by the end of the > > test (commands don't hang, etc) > > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 > > indefinitely > > > > > > > > > One can argue that this is a broken server and we shouldn't > > > > > bother. > > > > > But this patch is an attempt to argue that the client still > > > > > should > > > > > care and deal with this condition. However, if the community > > > > > feels > > > > > that a broken server is a broken server and this form of an > > > > > attack is > > > > > not interested, this patch can live will be an archive for > > > > > later > > > > > or > > > > > never. > > > > > > > > > > > > > This isn't IPoIB is it? > > > > > > No, just normal TCP. > > > > > > > > > > Actually, fwiw, looking back I had speculated on changes in this > > > > area. > > > > I'm adding you to the CC list of this bug which had some of my > > > > musings > > > > on it: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > > > That bug I ended up closing when we could no longer prove there > > > > was > > > > any > > > > state where the NFS client could get stuck in FIN_WAIT2 after the > > > > keepalive patch. > > > > > > I can reproduce current behavior with the current upstream code. > > > > > > > > > > It can happen that the server only sends the ACK back to the > > > > clients > > > > FIN,ACK so in general the testcase is valid. But then the > > > > question > > > > is > > > > how long should one wait for the final data and FIN from the > > > > server, or > > > > are there ever instances where you shouldn't wait forever? Is > > > > there a > > > > way for us to know for sure there is no data left to receive so > > > > it's > > > > safe to timeout? No RPCs outstanding? > > > > > > Yep all good questions which I don't have answers to. I realize > > > that > > > for instance, that a server might send an ACK and then send a > > > FIN/ACK > > > after that. But why is it bad for the client to proactively send an > > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client > > > was shutting down the connection anyway. > > > > > I think you could lose data from the server if you RST and don't wait > > but other than that I don't know. We may be splitting hairs here > > though if there's no demonstrable harm to the application (NFS). As > > Trond points out elsewhere reconnect / port reuse may be an issue. > > > > When I looked at this in the second bug I was almost convinced that > > there was a problem with the 'close' method in xs_tcp_ops - we needed > > to be calling xs_close(), but I ran into some problems and I wasn't > > sure about the test case. > > > > > > > > I don't claim to know many of the subtleties here as far as would > > > > the > > > > server wait forever in LAST_ACK or do implementations eventually > > > > timeout after some time? Seems like if these last packets get > > > > lost > > > > there is likely a bug somewhere (either firewall or TCP stack, > > > > etc). > > > > https://tools.ietf.org/html/rfc793#page-22 > > > > > > > > It looks like at least some people are putting timeouts into > > > > their > > > > stacks though I'm not sure that's a good idea or not. > > > > > > I saw only one other driver in the kernel that does have handling > > > of > > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... > > > > One final thought - is it possible that we could have a network that > > does not support TCP keepalives? In that instance, I'll bet we could > > get an indefinite hang in FIN_WAIT2 on my first test case (server > > crashed after the half-close state). > > > > It does not look like TCP keepalives have to be implemented and maybe > > some routers / NATs / firewalls disallow them (due to extra traffic)? > > https://tools.ietf.org/html/rfc1122#page-101 > > > Again, that's a corner case of a corner case. I've never seen any > reports of this occurring. > > On the other hand, I _have_ seen lots of reports of problems due to the > reconnection issues when the server still holds the socket open. Those > reports were the main reason for the current design. Continuing on with this thread: what I can gather so far. NFS has an idle timer itself which expires INIT_WORK(&xprt->task_cleanup, xprt_autoclose); if (xprt_has_timer(xprt)) timer_setup(&xprt->timer, xprt_init_autodisconnect, 0); else timer_setup(&xprt->timer, NULL, 0); The xprt_autoclose() is what triggers TCP client to send tcp_send_fin(). Xprt_autoclose() calls kernel_Sock_shutdown() but I don’t believe that’s sufficient to “close the socket”. What is needed is to call sock_release() so socket is not orphaned. As a reply to its FIN/ACK it gets back an ACK (putting it in FIN_WAIT2 which TCP notifies the NFS and NFS ignores it instead of closing the socket as it does for other states like CLOSE_WAIT). Socket is not closed so no tcp_fin_timeout is set. > > -- > Trond Myklebust > CTO, Hammerspace Inc > 4300 El Camino Real, Suite 105 > Los Altos, CA 94022 > www.hammer.space > >
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 618e9c2..812e5e3 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock *sk) clear_bit(XPRT_CLOSE_WAIT, &xprt->state); smp_mb__after_atomic(); break; + case TCP_FIN_WAIT2: case TCP_CLOSE_WAIT: /* The server initiated a shutdown of the socket */ xprt->connect_cookie++; @@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt *xprt) kernel_sock_shutdown(sock, SHUT_RDWR); trace_rpc_socket_shutdown(xprt, sock); break; + case TCP_FIN_WAIT2: case TCP_CLOSE: case TCP_TIME_WAIT: xs_reset_transport(transport);