Message ID | 20240812105315.440718-1-kuro@kuroa.me (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] tcp: fix forever orphan socket caused by tcp_abort | expand |
On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > environment. This patch come from the investigation. > > Previously tcp_abort only sends out reset and calls tcp_done when the > socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only > purging the write queue, but not close the socket and left it to the > timer. > > While purging the write queue, tp->packets_out and sk->sk_write_queue > is cleared along the way. However tcp_retransmit_timer have early > return based on !tp->packets_out and tcp_probe_timer have early > return based on !sk->sk_write_queue. > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > and socket not being killed by the timers, converting a zero-windowed > orphan into a forever orphan. > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > reset to peer and close the socket accordingly. Preventing the > timer-less orphan from happening. > > According to Lorenzo's email in the v1 thread, the check was there to > prevent force-closing the same socket twice. That situation is handled > by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is > already closed. > > The -ENOENT code comes from the associate patch Lorenzo made for > iproute2-ss; link attached below. > > Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/ > Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.") > Signed-off-by: Xueming Feng <kuro@kuroa.me> You seem to have forgotten to CC Jakub and Paolo which are also networking maintainers. > --- > net/ipv4/tcp.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index e03a342c9162..831a18dc7aa6 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err) > /* Don't race with userspace socket closes such as tcp_close. */ > lock_sock(sk); > > + /* Avoid closing the same socket twice. */ > + if (sk->sk_state == TCP_CLOSE) { > + if (!has_current_bpf_ctx()) > + release_sock(sk); > + return -ENOENT; > + } > + > if (sk->sk_state == TCP_LISTEN) { > tcp_set_state(sk, TCP_CLOSE); > inet_csk_listen_stop(sk); > @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err) > local_bh_disable(); > bh_lock_sock(sk); > > - if (!sock_flag(sk, SOCK_DEAD)) { > - if (tcp_need_reset(sk->sk_state)) > - tcp_send_active_reset(sk, GFP_ATOMIC, > - SK_RST_REASON_NOT_SPECIFIED); > - tcp_done_with_error(sk, err); > - } > + if (tcp_need_reset(sk->sk_state)) > + tcp_send_active_reset(sk, GFP_ATOMIC, > + SK_RST_REASON_NOT_SPECIFIED); Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier. Please feel free to add: Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> in your next submission. Thanks, Jason > + tcp_done_with_error(sk, err); > > bh_unlock_sock(sk); > local_bh_enable(); > - tcp_write_queue_purge(sk); > if (!has_current_bpf_ctx()) > release_sock(sk); > return 0; > --
On Mon, Aug 12, 2024 at 10:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > > environment. This patch come from the investigation. > > > > Previously tcp_abort only sends out reset and calls tcp_done when the > > socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only > > purging the write queue, but not close the socket and left it to the > > timer. > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue > > is cleared along the way. However tcp_retransmit_timer have early > > return based on !tp->packets_out and tcp_probe_timer have early > > return based on !sk->sk_write_queue. > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > > and socket not being killed by the timers, converting a zero-windowed > > orphan into a forever orphan. > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > reset to peer and close the socket accordingly. Preventing the > > timer-less orphan from happening. > > > > According to Lorenzo's email in the v1 thread, the check was there to > > prevent force-closing the same socket twice. That situation is handled > > by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is > > already closed. > > > > The -ENOENT code comes from the associate patch Lorenzo made for > > iproute2-ss; link attached below. > > > > Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/ > > Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.") > > Signed-off-by: Xueming Feng <kuro@kuroa.me> > > You seem to have forgotten to CC Jakub and Paolo which are also > networking maintainers. > > > --- > > net/ipv4/tcp.c | 18 +++++++++++------- > > 1 file changed, 11 insertions(+), 7 deletions(-) > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index e03a342c9162..831a18dc7aa6 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err) > > /* Don't race with userspace socket closes such as tcp_close. */ > > lock_sock(sk); > > > > + /* Avoid closing the same socket twice. */ > > + if (sk->sk_state == TCP_CLOSE) { > > + if (!has_current_bpf_ctx()) > > + release_sock(sk); > > + return -ENOENT; > > + } > > + > > if (sk->sk_state == TCP_LISTEN) { > > tcp_set_state(sk, TCP_CLOSE); > > inet_csk_listen_stop(sk); > > @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err) > > local_bh_disable(); > > bh_lock_sock(sk); > > > > - if (!sock_flag(sk, SOCK_DEAD)) { > > - if (tcp_need_reset(sk->sk_state)) > > - tcp_send_active_reset(sk, GFP_ATOMIC, > > - SK_RST_REASON_NOT_SPECIFIED); > > - tcp_done_with_error(sk, err); > > - } > > + if (tcp_need_reset(sk->sk_state)) > > + tcp_send_active_reset(sk, GFP_ATOMIC, > > + SK_RST_REASON_NOT_SPECIFIED); > > Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier. Sorry, my fault. It's the net tree which doesn't have the commit edefba66d92 yet, so you don't need to modify this. As I said, the patch looks good to me. Let Eric make the decision finally. Thanks! Thanks, Jason
On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > The -ENOENT code comes from the associate patch Lorenzo made for > iproute2-ss; link attached below. ENOENT does seem reasonable. It's the same thing that would happen if userspace passed in a nonexistent cookie (we have a test for that). I'd guess this could happen if userspace was trying to destroy a socket but it lost the race against the process owning a socket closing it? > bh_unlock_sock(sk); > local_bh_enable(); > - tcp_write_queue_purge(sk); Is this not necessary in any other cases? What if there is retransmitted data, shouldn't that be cleared? Other than that, I have run the Android tests on this patch and they all passed other than the test that checks that closing FIN_WAIT1 sockets can't be closed. That's expected to fail because it checking that the kernel does not support what you are trying to support. I uploaded a CL to fix that: https://r.android.com/3217682 . Tested-By: Lorenzo Colitti <lorenzo@google.com>
On Wed, Aug 14, 2024 at 3:35 PM Lorenzo Colitti <lorenzo@google.com> wrote: > > On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > The -ENOENT code comes from the associate patch Lorenzo made for > > iproute2-ss; link attached below. > > ENOENT does seem reasonable. It's the same thing that would happen if > userspace passed in a nonexistent cookie (we have a test for that). > I'd guess this could happen if userspace was trying to destroy a > socket but it lost the race against the process owning a socket > closing it? > > > bh_unlock_sock(sk); > > local_bh_enable(); > > - tcp_write_queue_purge(sk); > > Is this not necessary in any other cases? What if there is > retransmitted data, shouldn't that be cleared? This line is duplicated, please see tcp_done_with_error()->tcp_write_queue_purge(). Thanks, Jason
On Mon, Aug 14, 2024 at 7:34 AM Lorenzo Colitti <lorenzo@google.com> wrote: > On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > The -ENOENT code comes from the associate patch Lorenzo made for > > iproute2-ss; link attached below. > > ENOENT does seem reasonable. It's the same thing that would happen if > userspace passed in a nonexistent cookie (we have a test for that). In the latest TCP RFC 9293, section 3.10.5 on the ABORT CALL, it mentions that an "error: connection does not exist" to be returned for a CLOSED STATE. I noticed this while verifying whether a reset in the FIN-WAIT STATE is legal, which it is. > I'd guess this could happen if userspace was trying to destroy a > socket but it lost the race against the process owning a socket > closing it? Yes, that’s exactly the scenario I'm addressing. I tested this locally by calling tcp_diag twice with the same socket pointer. > > > bh_unlock_sock(sk); > > local_bh_enable(); > > - tcp_write_queue_purge(sk); > > Is this not necessary in any other cases? What if there is > retransmitted data, shouldn't that be cleared? The tcp_write_queue_purge() function is indeed invoked within tcp_done_with_error(). In this patch, the tcp_done_with_error is elevated to the same logical level where tcp_write_queue_purge would typically be called. The difference is that the purge happens just before tcp_done. So the queue should still be cleared in other scenarios as well.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03a342c9162..831a18dc7aa6 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err) /* Don't race with userspace socket closes such as tcp_close. */ lock_sock(sk); + /* Avoid closing the same socket twice. */ + if (sk->sk_state == TCP_CLOSE) { + if (!has_current_bpf_ctx()) + release_sock(sk); + return -ENOENT; + } + if (sk->sk_state == TCP_LISTEN) { tcp_set_state(sk, TCP_CLOSE); inet_csk_listen_stop(sk); @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err) local_bh_disable(); bh_lock_sock(sk); - if (!sock_flag(sk, SOCK_DEAD)) { - if (tcp_need_reset(sk->sk_state)) - tcp_send_active_reset(sk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); - tcp_done_with_error(sk, err); - } + if (tcp_need_reset(sk->sk_state)) + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); + tcp_done_with_error(sk, err); bh_unlock_sock(sk); local_bh_enable(); - tcp_write_queue_purge(sk); if (!has_current_bpf_ctx()) release_sock(sk); return 0;
We have some problem closing zero-window fin-wait-1 tcp sockets in our environment. This patch come from the investigation. Previously tcp_abort only sends out reset and calls tcp_done when the socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only purging the write queue, but not close the socket and left it to the timer. While purging the write queue, tp->packets_out and sk->sk_write_queue is cleared along the way. However tcp_retransmit_timer have early return based on !tp->packets_out and tcp_probe_timer have early return based on !sk->sk_write_queue. This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched and socket not being killed by the timers, converting a zero-windowed orphan into a forever orphan. This patch removes the SOCK_DEAD check in tcp_abort, making it send reset to peer and close the socket accordingly. Preventing the timer-less orphan from happening. According to Lorenzo's email in the v1 thread, the check was there to prevent force-closing the same socket twice. That situation is handled by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is already closed. The -ENOENT code comes from the associate patch Lorenzo made for iproute2-ss; link attached below. Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/ Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.") Signed-off-by: Xueming Feng <kuro@kuroa.me> --- net/ipv4/tcp.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) --