diff mbox series

[net-next] tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process

Message ID 20240814035136.60796-1-kerneljasonxing@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 29 this patch: 29
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 29 this patch: 29
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest warning net-next-2024-08-15--00-00 (tests: 707)

Commit Message

Jason Xing Aug. 14, 2024, 3:51 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

We found that one close-wait socket was reset by the other side
which is beyond our expectation, so we have to investigate the
underlying reason. The following experiment is conducted in the
test environment. We limit the port range from 40000 to 40010
and delay the time to close() after receiving a fin from the
active close side, which can help us easily reproduce like what
happened in production.

Here are three connections captured by tcpdump:
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1
// a few seconds later, within 60 seconds
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2
127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193
// later, very quickly
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1

As we can see, the first flow is reset because:
1) client starts a new connection, I mean, the second one
2) client tries to find a suitable port which is a timewait socket
   (its state is timewait, substate is fin_wait2)
3) client occupies that timewait port to send a SYN
4) server finds a corresponding close-wait socket in ehash table,
   then replies with a challenge ack
5) client sends an RST to terminate this old close-wait socket.

I don't think the port selection algo can choose a FIN_WAIT2 socket
when we turn on tcp_tw_reuse because on the server side there
remain unread data. If one side haven't call close() yet, we should
not consider it as expendable and treat it at will.

Even though, sometimes, the server isn't able to call close() as soon
as possible like what we expect, it can not be terminated easily,
especially due to a second unrelated connection happening.

After this patch, we can see the expected failure if we start a
connection when all the ports are occupied in fin_wait2 state:
"Ncat: Cannot assign requested address."

Reported-by: Jade Dong <jadedong@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/inet_hashtables.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Kuniyuki Iwashima Aug. 14, 2024, 7:38 p.m. UTC | #1
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Wed, 14 Aug 2024 11:51:36 +0800
> From: Jason Xing <kernelxing@tencent.com>
> 
> We found that one close-wait socket was reset by the other side
> which is beyond our expectation, so we have to investigate the
> underlying reason. The following experiment is conducted in the
> test environment. We limit the port range from 40000 to 40010
> and delay the time to close() after receiving a fin from the
> active close side, which can help us easily reproduce like what
> happened in production.
> 
> Here are three connections captured by tcpdump:
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191
> 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1
> // a few seconds later, within 60 seconds
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
> 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193
> // later, very quickly
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
> 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805
> 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
> 
> As we can see, the first flow is reset because:
> 1) client starts a new connection, I mean, the second one
> 2) client tries to find a suitable port which is a timewait socket
>    (its state is timewait, substate is fin_wait2)
> 3) client occupies that timewait port to send a SYN
> 4) server finds a corresponding close-wait socket in ehash table,
>    then replies with a challenge ack
> 5) client sends an RST to terminate this old close-wait socket.
> 
> I don't think the port selection algo can choose a FIN_WAIT2 socket
> when we turn on tcp_tw_reuse because on the server side there
> remain unread data. If one side haven't call close() yet, we should
> not consider it as expendable and treat it at will.
> 
> Even though, sometimes, the server isn't able to call close() as soon
> as possible like what we expect, it can not be terminated easily,
> especially due to a second unrelated connection happening.
> 
> After this patch, we can see the expected failure if we start a
> connection when all the ports are occupied in fin_wait2 state:
> "Ncat: Cannot assign requested address."
> 
> Reported-by: Jade Dong <jadedong@tencent.com>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/ipv4/inet_hashtables.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 9bfcfd016e18..6115ee0c5d90 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
>  			continue;
>  
>  		if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) {
> -			if (sk2->sk_state == TCP_TIME_WAIT) {
> +			if (sk2->sk_state == TCP_TIME_WAIT &&
> +			    inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) {

I prefer comparing explicitly like

  inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT
Jason Xing Aug. 15, 2024, 4:29 a.m. UTC | #2
Hello Kuniyuki,

On Thu, Aug 15, 2024 at 3:39 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Wed, 14 Aug 2024 11:51:36 +0800
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > We found that one close-wait socket was reset by the other side
> > which is beyond our expectation, so we have to investigate the
> > underlying reason. The following experiment is conducted in the
> > test environment. We limit the port range from 40000 to 40010
> > and delay the time to close() after receiving a fin from the
> > active close side, which can help us easily reproduce like what
> > happened in production.
> >
> > Here are three connections captured by tcpdump:
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191
> > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1
> > // a few seconds later, within 60 seconds
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
> > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193
> > // later, very quickly
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
> > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805
> > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
> >
> > As we can see, the first flow is reset because:
> > 1) client starts a new connection, I mean, the second one
> > 2) client tries to find a suitable port which is a timewait socket
> >    (its state is timewait, substate is fin_wait2)
> > 3) client occupies that timewait port to send a SYN
> > 4) server finds a corresponding close-wait socket in ehash table,
> >    then replies with a challenge ack
> > 5) client sends an RST to terminate this old close-wait socket.
> >
> > I don't think the port selection algo can choose a FIN_WAIT2 socket
> > when we turn on tcp_tw_reuse because on the server side there
> > remain unread data. If one side haven't call close() yet, we should
> > not consider it as expendable and treat it at will.
> >
> > Even though, sometimes, the server isn't able to call close() as soon
> > as possible like what we expect, it can not be terminated easily,
> > especially due to a second unrelated connection happening.
> >
> > After this patch, we can see the expected failure if we start a
> > connection when all the ports are occupied in fin_wait2 state:
> > "Ncat: Cannot assign requested address."
> >
> > Reported-by: Jade Dong <jadedong@tencent.com>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/ipv4/inet_hashtables.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 9bfcfd016e18..6115ee0c5d90 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -563,7 +563,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
> >                       continue;
> >
> >               if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) {
> > -                     if (sk2->sk_state == TCP_TIME_WAIT) {
> > +                     if (sk2->sk_state == TCP_TIME_WAIT &&
> > +                         inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) {
>
> I prefer comparing explicitly like
>
>   inet_twsk(sk2)->tw_substate == TCP_TIME_WAIT

Thanks, I will adjust in the v2 patch soon.

Thanks,
Jason
diff mbox series

Patch

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9bfcfd016e18..6115ee0c5d90 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -563,7 +563,8 @@  static int __inet_check_established(struct inet_timewait_death_row *death_row,
 			continue;
 
 		if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) {
-			if (sk2->sk_state == TCP_TIME_WAIT) {
+			if (sk2->sk_state == TCP_TIME_WAIT &&
+			    inet_twsk(sk2)->tw_substate != TCP_FIN_WAIT2) {
 				tw = inet_twsk(sk2);
 				if (sk->sk_protocol == IPPROTO_TCP &&
 				    tcp_twsk_unique(sk, sk2, twp))