Message ID | 20240815113745.6668-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,net-next] tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process | expand |
From: Jason Xing <kerneljasonxing@gmail.com> Date: Thu, 15 Aug 2024 19:37:45 +0800 > From: Jason Xing <kernelxing@tencent.com> > > We found that one close-wait socket was reset by the other side > which is beyond our expectation, so we have to investigate the > underlying reason. The following experiment is conducted in the > test environment. We limit the port range from 40000 to 40010 > and delay the time to close() after receiving a fin from the > active close side, which can help us easily reproduce like what > happened in production. > > Here are three connections captured by tcpdump: > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > // a few seconds later, within 60 seconds > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > // later, very quickly > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > As we can see, the first flow is reset because: > 1) client starts a new connection, I mean, the second one > 2) client tries to find a suitable port which is a timewait socket > (its state is timewait, substate is fin_wait2) > 3) client occupies that timewait port to send a SYN > 4) server finds a corresponding close-wait socket in ehash table, > then replies with a challenge ack > 5) client sends an RST to terminate this old close-wait socket. > > I don't think the port selection algo can choose a FIN_WAIT2 socket > when we turn on tcp_tw_reuse because on the server side there > remain unread data. If one side haven't call close() yet, we should > not consider it as expendable and treat it at will. > > Even though, sometimes, the server isn't able to call close() as soon > as possible like what we expect, it can not be terminated easily, > especially due to a second unrelated connection happening. > > After this patch, we can see the expected failure if we start a > connection when all the ports are occupied in fin_wait2 state: > "Ncat: Cannot assign requested address." > > Reported-by: Jade Dong <jadedong@tencent.com> > Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Thanks!
Hello Eric, On Thu, Aug 15, 2024 at 7:37 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > We found that one close-wait socket was reset by the other side > which is beyond our expectation, so we have to investigate the > underlying reason. The following experiment is conducted in the > test environment. We limit the port range from 40000 to 40010 > and delay the time to close() after receiving a fin from the > active close side, which can help us easily reproduce like what > happened in production. > > Here are three connections captured by tcpdump: > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > // a few seconds later, within 60 seconds > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > // later, very quickly > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > As we can see, the first flow is reset because: > 1) client starts a new connection, I mean, the second one > 2) client tries to find a suitable port which is a timewait socket > (its state is timewait, substate is fin_wait2) > 3) client occupies that timewait port to send a SYN > 4) server finds a corresponding close-wait socket in ehash table, > then replies with a challenge ack > 5) client sends an RST to terminate this old close-wait socket. > > I don't think the port selection algo can choose a FIN_WAIT2 socket > when we turn on tcp_tw_reuse because on the server side there > remain unread data. If one side haven't call close() yet, we should > not consider it as expendable and treat it at will. > > Even though, sometimes, the server isn't able to call close() as soon > as possible like what we expect, it can not be terminated easily, > especially due to a second unrelated connection happening. > > After this patch, we can see the expected failure if we start a > connection when all the ports are occupied in fin_wait2 state: > "Ncat: Cannot assign requested address." > > Reported-by: Jade Dong <jadedong@tencent.com> > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > v2 > Link: https://lore.kernel.org/all/20240814035136.60796-1-kerneljasonxing@gmail.com/ > 1. change from fin_wait2 to timewait test statement, no functional > change (Kuniyuki) > --- > net/ipv4/inet_hashtables.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index 9bfcfd016e18..b95215fc15f6 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row, > continue; > > if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) { > - if (sk2->sk_state == TCP_TIME_WAIT) { > + if (sk2->sk_state == TCP_TIME_WAIT && > + inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT) { > tw = inet_twsk(sk2); > if (sk->sk_protocol == IPPROTO_TCP && > tcp_twsk_unique(sk, sk2, twp)) > -- > 2.37.3 > Not rushing you, so please please don't get me wrong. I'm a little bit worried if this email is submerged in the mailing list. So please also help me review this one :) Thanks, Jason
On 8/15/24 13:37, Jason Xing wrote: > From: Jason Xing <kernelxing@tencent.com> > > We found that one close-wait socket was reset by the other side > which is beyond our expectation, I'm unsure if you should instead reconsider your expectation: what if the client application does: shutdown(fd, SHUT_WR) close(fd); // with unread data ? Thanks, Paolo
On Tue, Aug 20, 2024 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On 8/15/24 13:37, Jason Xing wrote: > > From: Jason Xing <kernelxing@tencent.com> > > > > We found that one close-wait socket was reset by the other side > > which is beyond our expectation, > > I'm unsure if you should instead reconsider your expectation: what if > the client application does: > > shutdown(fd, SHUT_WR) > close(fd); // with unread data > Also, I was hoping someone would mention IPv6 at some point. Jason, instead of a lengthy ChatGPT-style changelog, I would prefer a packetdrill test exactly showing the issue.
Hello Paolo, On Tue, Aug 20, 2024 at 7:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On 8/15/24 13:37, Jason Xing wrote: > > From: Jason Xing <kernelxing@tencent.com> > > > > We found that one close-wait socket was reset by the other side > > which is beyond our expectation, > > I'm unsure if you should instead reconsider your expectation: what if > the client application does: > > shutdown(fd, SHUT_WR) > close(fd); // with unread data Thanks for the review. Perhaps, I didn't clearly describe the details here. I'm not surprised that the close-wait socket could be reset like you said. It's normal behaviour. I'm surprised that the close-wait socket is reset by another flow just because of the tw-reuse feature. Can we reuse the port and then reset the previous connection which stays in the close-wait state, I wonder? Thanks, Jason
Hello Eric, On Tue, Aug 20, 2024 at 8:39 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Aug 20, 2024 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On 8/15/24 13:37, Jason Xing wrote: > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > We found that one close-wait socket was reset by the other side > > > which is beyond our expectation, > > > > I'm unsure if you should instead reconsider your expectation: what if > > the client application does: > > > > shutdown(fd, SHUT_WR) > > close(fd); // with unread data > > > > Also, I was hoping someone would mention IPv6 at some point. Thanks for reminding me. I'll dig into the IPv6 logic. > > Jason, instead of a lengthy ChatGPT-style changelog, I would prefer a LOL, but sorry, I manually control the length which makes it look strange, I'll adjust it. > packetdrill test exactly showing the issue. I will try the packetdrill. After this patch applied in my local kernel, if we have some remote sockets delaying calling close(), it turns out that the client side will not reuse the fin_wait2 port like when we disable the tw reuse feature. Thanks, Jason
Hello Eric, On Tue, Aug 20, 2024 at 8:54 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, Aug 20, 2024 at 8:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Aug 20, 2024 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On 8/15/24 13:37, Jason Xing wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > We found that one close-wait socket was reset by the other side > > > > which is beyond our expectation, > > > > > > I'm unsure if you should instead reconsider your expectation: what if > > > the client application does: > > > > > > shutdown(fd, SHUT_WR) > > > close(fd); // with unread data > > > > > > > Also, I was hoping someone would mention IPv6 at some point. > > Thanks for reminding me. I'll dig into the IPv6 logic. > > > > > Jason, instead of a lengthy ChatGPT-style changelog, I would prefer a > > LOL, but sorry, I manually control the length which makes it look > strange, I'll adjust it. > > > packetdrill test exactly showing the issue. > > I will try the packetdrill. > Sorry that I'm not that good at writing such a case, I failed to add TS option which will be used in tcp_twsk_unique. So I think I need more time. In case that I do not have much time working on this packetdrill, I decided to copy my simple test code as follows. The code is very very simple. Please take a look at it :) Client: we can use: #include <stdio.h> #include <unistd.h> #include <string.h> #include <stdlib.h> #include <arpa/inet.h> #include <sys/socket.h> #include <netinet/in.h> int main(int argc, char *argv[]) { unsigned short port = 8000; char *server_ip = "127.0.0.1"; int sockfd; sockfd = socket(AF_INET, SOCK_STREAM, 0); if(sockfd < 0) { perror("socket"); exit(-1); } struct sockaddr_in server_addr; bzero(&server_addr,sizeof(server_addr)); server_addr.sin_family = AF_INET; server_addr.sin_port = htons(port); inet_pton(AF_INET, server_ip, &server_addr.sin_addr); int err_log = connect(sockfd, (struct sockaddr*)&server_addr, sizeof(server_addr)); if(err_log != 0) { perror("connect"); close(sockfd); exit(-1); } close(sockfd); return 0; } or use the following command: nc -vz 127.0.0.1 8000 Server: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> int main(int argc, char *argv[]) { unsigned short port = 8000; int sockfd = socket(AF_INET, SOCK_STREAM, 0); if(sockfd < 0) { perror("socket"); exit(-1); } struct sockaddr_in my_addr; bzero(&my_addr, sizeof(my_addr)); my_addr.sin_family = AF_INET; my_addr.sin_port = htons(port); my_addr.sin_addr.s_addr = htonl(INADDR_ANY); int err_log = bind(sockfd, (struct sockaddr*)&my_addr, sizeof(my_addr)); if( err_log != 0) { perror("binding"); close(sockfd); exit(-1); } err_log = listen(sockfd, 2); if(err_log != 0) { perror("listen"); close(sockfd); exit(-1); } while(1) { struct sockaddr_in client_addr; char cli_ip[INET_ADDRSTRLEN] = ""; socklen_t cliaddr_len = sizeof(client_addr); int connfd; connfd = accept(sockfd, (struct sockaddr*)&client_addr, &cliaddr_len); if(connfd < 0) { perror("accept"); continue; } sleep(20); // delay close(connfd); } close(sockfd); return 0; } They are the basic networking communication code. So this issue can be easily reproduced by enabling/disabling net.ipv4.tcp_tw_reuse. As I replied to Paolo last night, maybe I was wrong, but I still reckon a close-wait socket receives reset skb from a new flow and then reset, which is incredible. After applying this patch, if we meet this case, we will find the kernel behaves like when we switch net.ipv4.tcp_tw_reuse to zero, notified with "connect: Cannot assign requested address". I wonder if I understand in the right way? Thanks, Jason
On Wed, Aug 21, 2024 at 12:03 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, Aug 20, 2024 at 8:54 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Tue, Aug 20, 2024 at 8:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Aug 20, 2024 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On 8/15/24 13:37, Jason Xing wrote: > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > We found that one close-wait socket was reset by the other side > > > > > which is beyond our expectation, > > > > > > > > I'm unsure if you should instead reconsider your expectation: what if > > > > the client application does: > > > > > > > > shutdown(fd, SHUT_WR) > > > > close(fd); // with unread data > > > > > > > > > > Also, I was hoping someone would mention IPv6 at some point. > > > > Thanks for reminding me. I'll dig into the IPv6 logic. > > > > > > > > Jason, instead of a lengthy ChatGPT-style changelog, I would prefer a > > > > LOL, but sorry, I manually control the length which makes it look > > strange, I'll adjust it. > > > > > packetdrill test exactly showing the issue. > > > > I will try the packetdrill. > > > > Sorry that I'm not that good at writing such a case, I failed to add > TS option which will be used in tcp_twsk_unique. So I think I need > more time. The following patch looks better to me, it covers the case where twp == NULL, and is family independent. It is also clear it will not impact DCCP without having to think about it. diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288a47fca3ec1881c87d56bd9989709..43a3362e746f331ac64b5e4e6de6878ecd27e115 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -144,6 +144,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) reuse = 0; } + if (tw->tw_substate == TCP_FIN_WAIT2) + reuse = 0; /* With PAWS, it is safe from the viewpoint of data integrity. Even without PAWS it is safe provided sequence spaces do not overlap i.e. at data rates <= 80Mbit/sec.
On Wed, Aug 21, 2024 at 8:39 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Aug 21, 2024 at 12:03 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Tue, Aug 20, 2024 at 8:54 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > Hello Eric, > > > > > > On Tue, Aug 20, 2024 at 8:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Aug 20, 2024 at 1:04 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > On 8/15/24 13:37, Jason Xing wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > We found that one close-wait socket was reset by the other side > > > > > > which is beyond our expectation, > > > > > > > > > > I'm unsure if you should instead reconsider your expectation: what if > > > > > the client application does: > > > > > > > > > > shutdown(fd, SHUT_WR) > > > > > close(fd); // with unread data > > > > > > > > > > > > > Also, I was hoping someone would mention IPv6 at some point. > > > > > > Thanks for reminding me. I'll dig into the IPv6 logic. > > > > > > > > > > > Jason, instead of a lengthy ChatGPT-style changelog, I would prefer a > > > > > > LOL, but sorry, I manually control the length which makes it look > > > strange, I'll adjust it. > > > > > > > packetdrill test exactly showing the issue. > > > > > > I will try the packetdrill. > > > > > > > Sorry that I'm not that good at writing such a case, I failed to add > > TS option which will be used in tcp_twsk_unique. So I think I need > > more time. > > The following patch looks better to me, it covers the case where twp == NULL, > and is family independent. Right, thanks for your help! Thanks, Jason > It is also clear it will not impact DCCP without having to think about it. > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fd17f25ff288a47fca3ec1881c87d56bd9989709..43a3362e746f331ac64b5e4e6de6878ecd27e115 > 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -144,6 +144,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock > *sktw, void *twp) > reuse = 0; > } > > + if (tw->tw_substate == TCP_FIN_WAIT2) > + reuse = 0; > /* With PAWS, it is safe from the viewpoint > of data integrity. Even without PAWS it is safe provided sequence > spaces do not overlap i.e. at data rates <= 80Mbit/sec.
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 9bfcfd016e18..b95215fc15f6 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row, continue; if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) { - if (sk2->sk_state == TCP_TIME_WAIT) { + if (sk2->sk_state == TCP_TIME_WAIT && + inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT) { tw = inet_twsk(sk2); if (sk->sk_protocol == IPPROTO_TCP && tcp_twsk_unique(sk, sk2, twp))