Message ID | 20240821153325.3204-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v3,net-next] tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process | expand |
On Wed, Aug 21, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > We found that one close-wait socket was reset by the other side > due to a new connection reusing the same port which is beyond our > expectation, so we have to investigate the underlying reason. > > The following experiment is conducted in the test environment. We > limit the port range from 40000 to 40010 and delay the time to close() > after receiving a fin from the active close side, which can help us > easily reproduce like what happened in production. > > Here are three connections captured by tcpdump: > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > // a few seconds later, within 60 seconds > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > // later, very quickly > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > As we can see, the first flow is reset because: > 1) client starts a new connection, I mean, the second one > 2) client tries to find a suitable port which is a timewait socket > (its state is timewait, substate is fin_wait2) > 3) client occupies that timewait port to send a SYN > 4) server finds a corresponding close-wait socket in ehash table, > then replies with a challenge ack > 5) client sends an RST to terminate this old close-wait socket. > > I don't think the port selection algo can choose a FIN_WAIT2 socket > when we turn on tcp_tw_reuse because on the server side there > remain unread data. In some cases, if one side haven't call close() yet, > we should not consider it as expendable and treat it at will. > > Even though, sometimes, the server isn't able to call close() as soon > as possible like what we expect, it can not be terminated easily, > especially due to a second unrelated connection happening. > > After this patch, we can see the expected failure if we start a > connection when all the ports are occupied in fin_wait2 state: > "Ncat: Cannot assign requested address." > > Reported-by: Jade Dong <jadedong@tencent.com> > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > v3 > Link: https://lore.kernel.org/all/20240815113745.6668-1-kerneljasonxing@gmail.com/ > 1. take the ipv6 case into consideration. (Eric) > > v2 > Link: https://lore.kernel.org/all/20240814035136.60796-1-kerneljasonxing@gmail.com/ > 1. change from fin_wait2 to timewait test statement, no functional > change (Kuniyuki) > --- > net/ipv4/tcp_ipv4.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fd17f25ff288..b37c70d292bc 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -144,6 +144,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) > reuse = 0; > } > > + if (tw->tw_substate == TCP_FIN_WAIT2) > + reuse = 0; > + sysctl_tcp_tw_reuse default value being 2, I would suggest doing this test earlier, to avoid unneeded work. diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c2860480099f216d69fc570efdb991d2304be785..9af18d0293cd6655faf4eeb60ff3d41ce94ae843 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -118,6 +118,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) struct tcp_sock *tp = tcp_sk(sk); int ts_recent_stamp; + if (tw->tw_substate == TCP_FIN_WAIT2) + reuse = 0; + if (reuse == 2) { /* Still does not detect *everything* that goes through * lo, since we require a loopback src or dst address
On Wed, Aug 21, 2024 at 11:47 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Aug 21, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > We found that one close-wait socket was reset by the other side > > due to a new connection reusing the same port which is beyond our > > expectation, so we have to investigate the underlying reason. > > > > The following experiment is conducted in the test environment. We > > limit the port range from 40000 to 40010 and delay the time to close() > > after receiving a fin from the active close side, which can help us > > easily reproduce like what happened in production. > > > > Here are three connections captured by tcpdump: > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 > > // a few seconds later, within 60 seconds > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 > > // later, very quickly > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 > > 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 > > 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 > > > > As we can see, the first flow is reset because: > > 1) client starts a new connection, I mean, the second one > > 2) client tries to find a suitable port which is a timewait socket > > (its state is timewait, substate is fin_wait2) > > 3) client occupies that timewait port to send a SYN > > 4) server finds a corresponding close-wait socket in ehash table, > > then replies with a challenge ack > > 5) client sends an RST to terminate this old close-wait socket. > > > > I don't think the port selection algo can choose a FIN_WAIT2 socket > > when we turn on tcp_tw_reuse because on the server side there > > remain unread data. In some cases, if one side haven't call close() yet, > > we should not consider it as expendable and treat it at will. > > > > Even though, sometimes, the server isn't able to call close() as soon > > as possible like what we expect, it can not be terminated easily, > > especially due to a second unrelated connection happening. > > > > After this patch, we can see the expected failure if we start a > > connection when all the ports are occupied in fin_wait2 state: > > "Ncat: Cannot assign requested address." > > > > Reported-by: Jade Dong <jadedong@tencent.com> > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > v3 > > Link: https://lore.kernel.org/all/20240815113745.6668-1-kerneljasonxing@gmail.com/ > > 1. take the ipv6 case into consideration. (Eric) > > > > v2 > > Link: https://lore.kernel.org/all/20240814035136.60796-1-kerneljasonxing@gmail.com/ > > 1. change from fin_wait2 to timewait test statement, no functional > > change (Kuniyuki) > > --- > > net/ipv4/tcp_ipv4.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index fd17f25ff288..b37c70d292bc 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -144,6 +144,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) > > reuse = 0; > > } > > > > + if (tw->tw_substate == TCP_FIN_WAIT2) > > + reuse = 0; > > + > > sysctl_tcp_tw_reuse default value being 2, I would suggest doing this > test earlier, > to avoid unneeded work. Thanks. I should have thought of that. I will submit it ~24 hours later. Thanks, Jason > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index c2860480099f216d69fc570efdb991d2304be785..9af18d0293cd6655faf4eeb60ff3d41ce94ae843 > 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -118,6 +118,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock > *sktw, void *twp) > struct tcp_sock *tp = tcp_sk(sk); > int ts_recent_stamp; > > + if (tw->tw_substate == TCP_FIN_WAIT2) > + reuse = 0; > + > if (reuse == 2) { > /* Still does not detect *everything* that goes through > * lo, since we require a loopback src or dst address
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288..b37c70d292bc 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -144,6 +144,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) reuse = 0; } + if (tw->tw_substate == TCP_FIN_WAIT2) + reuse = 0; + /* With PAWS, it is safe from the viewpoint of data integrity. Even without PAWS it is safe provided sequence spaces do not overlap i.e. at data rates <= 80Mbit/sec.