Message ID | 20240814035136.60796-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process | expand |
From: Jason Xing <kerneljasonxing@gmail.com> Date: Wed, 14 Aug 2024 11:51:36 +0800 > From: Jason Xing <kernelxing@tencent.com> > > We found that one close-wait socket was reset by the other side > which is beyond our expectation, so we have to investigate the > underlying reason. The following experiment is conducted in the > test environment. We limit the port range from 40000 to 40010 > and delay the time to close() after receiving a fin from the > active close side, which can help us easily reproduce like what > happened in production. > > Here are three connections captured by tcpdump: > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > // a few seconds later, within 60 seconds > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > // later, very quickly > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > As we can see, the first flow is reset because: > 1) client starts a new connection, I mean, the second one > 2) client tries to find a suitable port which is a timewait socket > (its state is timewait, substate is fin_wait2) > 3) client occupies that timewait port to send a SYN > 4) server finds a corresponding close-wait socket in ehash table, > then replies with a challenge ack > 5) client sends an RST to terminate this old close-wait socket. > > I don't think the port selection algo can choose a FIN_WAIT2 socket > when we turn on tcp_tw_reuse because on the server side there > remain unread data. If one side haven't call close() yet, we should > not consider it as expendable and treat it at will. > > Even though, sometimes, the server isn't able to call close() as soon > as possible like what we expect, it can not be terminated easily, > especially due to a second unrelated connection happening. > > After this patch, we can see the expected failure if we start a > connection when all the ports are occupied in fin_wait2 state: > "Ncat: Cannot assign requested address." > > Reported-by: Jade Dong <jadedong@tencent.com> > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > net/ipv4/inet_hashtables.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index 9bfcfd016e18..6115ee0c5d90 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row, > continue; > > if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) { > - if (sk2->sk_state == TCP_TIME_WAIT) { > + if (sk2->sk_state == TCP_TIME_WAIT && > + inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) { I prefer comparing explicitly like inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT
Hello Kuniyuki, On Thu, Aug 15, 2024 at 3:39 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Jason Xing <kerneljasonxing@gmail.com> > Date: Wed, 14 Aug 2024 11:51:36 +0800 > > From: Jason Xing <kernelxing@tencent.com> > > > > We found that one close-wait socket was reset by the other side > > which is beyond our expectation, so we have to investigate the > > underlying reason. The following experiment is conducted in the > > test environment. We limit the port range from 40000 to 40010 > > and delay the time to close() after receiving a fin from the > > active close side, which can help us easily reproduce like what > > happened in production. > > > > Here are three connections captured by tcpdump: > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > > // a few seconds later, within 60 seconds > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > > // later, very quickly > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > > > As we can see, the first flow is reset because: > > 1) client starts a new connection, I mean, the second one > > 2) client tries to find a suitable port which is a timewait socket > > (its state is timewait, substate is fin_wait2) > > 3) client occupies that timewait port to send a SYN > > 4) server finds a corresponding close-wait socket in ehash table, > > then replies with a challenge ack > > 5) client sends an RST to terminate this old close-wait socket. > > > > I don't think the port selection algo can choose a FIN_WAIT2 socket > > when we turn on tcp_tw_reuse because on the server side there > > remain unread data. If one side haven't call close() yet, we should > > not consider it as expendable and treat it at will. > > > > Even though, sometimes, the server isn't able to call close() as soon > > as possible like what we expect, it can not be terminated easily, > > especially due to a second unrelated connection happening. > > > > After this patch, we can see the expected failure if we start a > > connection when all the ports are occupied in fin_wait2 state: > > "Ncat: Cannot assign requested address." > > > > Reported-by: Jade Dong <jadedong@tencent.com> > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > net/ipv4/inet_hashtables.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > > index 9bfcfd016e18..6115ee0c5d90 100644 > > --- a/net/ipv4/inet_hashtables.c > > +++ b/net/ipv4/inet_hashtables.c > > @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row, > > continue; > > > > if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) { > > - if (sk2->sk_state == TCP_TIME_WAIT) { > > + if (sk2->sk_state == TCP_TIME_WAIT && > > + inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) { > > I prefer comparing explicitly like > > inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT Thanks, I will adjust in the v2 patch soon. Thanks, Jason
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 9bfcfd016e18..6115ee0c5d90 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row, continue; if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) { - if (sk2->sk_state == TCP_TIME_WAIT) { + if (sk2->sk_state == TCP_TIME_WAIT && + inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) { tw = inet_twsk(sk2); if (sk->sk_protocol == IPPROTO_TCP && tcp_twsk_unique(sk, sk2, twp))