diff mbox series

[net,v2] tcp: fix forever orphan socket caused by tcp_abort

Message ID 20240812105315.440718-1-kuro@kuroa.me (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] tcp: fix forever orphan socket caused by tcp_abort | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 29 this patch: 29
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 3 maintainers not CCed: kuba@kernel.org bpf@vger.kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 29 this patch: 29
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 33 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Xueming Feng Aug. 12, 2024, 10:53 a.m. UTC
We have some problem closing zero-window fin-wait-1 tcp sockets in our 
environment. This patch come from the investigation.

Previously tcp_abort only sends out reset and calls tcp_done when the 
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only 
purging the write queue, but not close the socket and left it to the 
timer.

While purging the write queue, tp->packets_out and sk->sk_write_queue 
is cleared along the way. However tcp_retransmit_timer have early 
return based on !tp->packets_out and tcp_probe_timer have early 
return based on !sk->sk_write_queue.

This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched 
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.

This patch removes the SOCK_DEAD check in tcp_abort, making it send 
reset to peer and close the socket accordingly. Preventing the 
timer-less orphan from happening.

According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.

The -ENOENT code comes from the associate patch Lorenzo made for 
iproute2-ss; link attached below.

Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro@kuroa.me>
---
 net/ipv4/tcp.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

--

Comments

Jason Xing Aug. 12, 2024, 2 p.m. UTC | #1
On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote:
>
> We have some problem closing zero-window fin-wait-1 tcp sockets in our
> environment. This patch come from the investigation.
>
> Previously tcp_abort only sends out reset and calls tcp_done when the
> socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
> purging the write queue, but not close the socket and left it to the
> timer.
>
> While purging the write queue, tp->packets_out and sk->sk_write_queue
> is cleared along the way. However tcp_retransmit_timer have early
> return based on !tp->packets_out and tcp_probe_timer have early
> return based on !sk->sk_write_queue.
>
> This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
> and socket not being killed by the timers, converting a zero-windowed
> orphan into a forever orphan.
>
> This patch removes the SOCK_DEAD check in tcp_abort, making it send
> reset to peer and close the socket accordingly. Preventing the
> timer-less orphan from happening.
>
> According to Lorenzo's email in the v1 thread, the check was there to
> prevent force-closing the same socket twice. That situation is handled
> by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
> already closed.
>
> The -ENOENT code comes from the associate patch Lorenzo made for
> iproute2-ss; link attached below.
>
> Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
> Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
> Signed-off-by: Xueming Feng <kuro@kuroa.me>

You seem to have forgotten to CC Jakub and Paolo which are also
networking maintainers.

> ---
>  net/ipv4/tcp.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162..831a18dc7aa6 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
>                 /* Don't race with userspace socket closes such as tcp_close. */
>                 lock_sock(sk);
>
> +       /* Avoid closing the same socket twice. */
> +       if (sk->sk_state == TCP_CLOSE) {
> +               if (!has_current_bpf_ctx())
> +                       release_sock(sk);
> +               return -ENOENT;
> +       }
> +
>         if (sk->sk_state == TCP_LISTEN) {
>                 tcp_set_state(sk, TCP_CLOSE);
>                 inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
>
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> -                                             SK_RST_REASON_NOT_SPECIFIED);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC,
> +                                     SK_RST_REASON_NOT_SPECIFIED);

Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier.

Please feel free to add:
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
in your next submission.

Thanks,
Jason

> +       tcp_done_with_error(sk, err);
>
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         if (!has_current_bpf_ctx())
>                 release_sock(sk);
>         return 0;
> --
Jason Xing Aug. 12, 2024, 10:15 p.m. UTC | #2
On Mon, Aug 12, 2024 at 10:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote:
> >
> > We have some problem closing zero-window fin-wait-1 tcp sockets in our
> > environment. This patch come from the investigation.
> >
> > Previously tcp_abort only sends out reset and calls tcp_done when the
> > socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
> > purging the write queue, but not close the socket and left it to the
> > timer.
> >
> > While purging the write queue, tp->packets_out and sk->sk_write_queue
> > is cleared along the way. However tcp_retransmit_timer have early
> > return based on !tp->packets_out and tcp_probe_timer have early
> > return based on !sk->sk_write_queue.
> >
> > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
> > and socket not being killed by the timers, converting a zero-windowed
> > orphan into a forever orphan.
> >
> > This patch removes the SOCK_DEAD check in tcp_abort, making it send
> > reset to peer and close the socket accordingly. Preventing the
> > timer-less orphan from happening.
> >
> > According to Lorenzo's email in the v1 thread, the check was there to
> > prevent force-closing the same socket twice. That situation is handled
> > by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
> > already closed.
> >
> > The -ENOENT code comes from the associate patch Lorenzo made for
> > iproute2-ss; link attached below.
> >
> > Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
> > Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
> > Signed-off-by: Xueming Feng <kuro@kuroa.me>
>
> You seem to have forgotten to CC Jakub and Paolo which are also
> networking maintainers.
>
> > ---
> >  net/ipv4/tcp.c | 18 +++++++++++-------
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e03a342c9162..831a18dc7aa6 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
> >                 /* Don't race with userspace socket closes such as tcp_close. */
> >                 lock_sock(sk);
> >
> > +       /* Avoid closing the same socket twice. */
> > +       if (sk->sk_state == TCP_CLOSE) {
> > +               if (!has_current_bpf_ctx())
> > +                       release_sock(sk);
> > +               return -ENOENT;
> > +       }
> > +
> >         if (sk->sk_state == TCP_LISTEN) {
> >                 tcp_set_state(sk, TCP_CLOSE);
> >                 inet_csk_listen_stop(sk);
> > @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
> >         local_bh_disable();
> >         bh_lock_sock(sk);
> >
> > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > -               if (tcp_need_reset(sk->sk_state))
> > -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> > -                                             SK_RST_REASON_NOT_SPECIFIED);
> > -               tcp_done_with_error(sk, err);
> > -       }
> > +       if (tcp_need_reset(sk->sk_state))
> > +               tcp_send_active_reset(sk, GFP_ATOMIC,
> > +                                     SK_RST_REASON_NOT_SPECIFIED);
>
> Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier.

Sorry, my fault. It's the net tree which doesn't have the commit
edefba66d92 yet, so you don't need to modify this.

As I said, the patch looks good to me. Let Eric make the decision
finally. Thanks!

Thanks,
Jason
Lorenzo Colitti Aug. 14, 2024, 7:34 a.m. UTC | #3
On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote:
> The -ENOENT code comes from the associate patch Lorenzo made for
> iproute2-ss; link attached below.

ENOENT does seem reasonable. It's the same thing that would happen if
userspace passed in a nonexistent cookie (we have a test for that).
I'd guess this could happen if userspace was trying to destroy a
socket but it lost the race against the process owning a socket
closing it?

>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);

Is this not necessary in any other cases? What if there is
retransmitted data, shouldn't that be cleared?

Other than that, I have run the Android tests on this patch and they
all passed other than the test that checks that closing FIN_WAIT1
sockets can't be closed. That's expected to fail because it checking
that the kernel does not support what you are trying to support. I
uploaded a CL to fix that: https://r.android.com/3217682 .

Tested-By: Lorenzo Colitti <lorenzo@google.com>
Jason Xing Aug. 14, 2024, 8:38 a.m. UTC | #4
On Wed, Aug 14, 2024 at 3:35 PM Lorenzo Colitti <lorenzo@google.com> wrote:
>
> On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote:
> > The -ENOENT code comes from the associate patch Lorenzo made for
> > iproute2-ss; link attached below.
>
> ENOENT does seem reasonable. It's the same thing that would happen if
> userspace passed in a nonexistent cookie (we have a test for that).
> I'd guess this could happen if userspace was trying to destroy a
> socket but it lost the race against the process owning a socket
> closing it?
>
> >         bh_unlock_sock(sk);
> >         local_bh_enable();
> > -       tcp_write_queue_purge(sk);
>
> Is this not necessary in any other cases? What if there is
> retransmitted data, shouldn't that be cleared?

This line is duplicated, please see
tcp_done_with_error()->tcp_write_queue_purge().

Thanks,
Jason
Xueming Feng Aug. 14, 2024, 8:46 a.m. UTC | #5
On Mon, Aug 14, 2024 at 7:34 AM Lorenzo Colitti <lorenzo@google.com> wrote:
> On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote:
> > The -ENOENT code comes from the associate patch Lorenzo made for
> > iproute2-ss; link attached below.
> 
> ENOENT does seem reasonable. It's the same thing that would happen if
> userspace passed in a nonexistent cookie (we have a test for that).

In the latest TCP RFC 9293, section 3.10.5 on the ABORT CALL, it mentions
that an "error: connection does not exist" to be returned for a CLOSED 
STATE. I noticed this while verifying whether a reset in the FIN-WAIT 
STATE is legal, which it is.

> I'd guess this could happen if userspace was trying to destroy a
> socket but it lost the race against the process owning a socket
> closing it?

Yes, that’s exactly the scenario I'm addressing. I tested this locally
by calling tcp_diag twice with the same socket pointer.

> 
> >        bh_unlock_sock(sk);
> >        local_bh_enable();
> > -       tcp_write_queue_purge(sk);
> 
> Is this not necessary in any other cases? What if there is
> retransmitted data, shouldn't that be cleared?

The tcp_write_queue_purge() function is indeed invoked within 
tcp_done_with_error(). In this patch, the tcp_done_with_error is elevated
to the same logical level where tcp_write_queue_purge would typically be 
called. The difference is that the purge happens just before tcp_done.
So the queue should still be cleared in other scenarios as well.
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03a342c9162..831a18dc7aa6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4637,6 +4637,13 @@  int tcp_abort(struct sock *sk, int err)
 		/* Don't race with userspace socket closes such as tcp_close. */
 		lock_sock(sk);
 
+	/* Avoid closing the same socket twice. */
+	if (sk->sk_state == TCP_CLOSE) {
+		if (!has_current_bpf_ctx())
+			release_sock(sk);
+		return -ENOENT;
+	}
+
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
 		inet_csk_listen_stop(sk);
@@ -4646,16 +4653,13 @@  int tcp_abort(struct sock *sk, int err)
 	local_bh_disable();
 	bh_lock_sock(sk);
 
-	if (!sock_flag(sk, SOCK_DEAD)) {
-		if (tcp_need_reset(sk->sk_state))
-			tcp_send_active_reset(sk, GFP_ATOMIC,
-					      SK_RST_REASON_NOT_SPECIFIED);
-		tcp_done_with_error(sk, err);
-	}
+	if (tcp_need_reset(sk->sk_state))
+		tcp_send_active_reset(sk, GFP_ATOMIC,
+				      SK_RST_REASON_NOT_SPECIFIED);
+	tcp_done_with_error(sk, err);
 
 	bh_unlock_sock(sk);
 	local_bh_enable();
-	tcp_write_queue_purge(sk);
 	if (!has_current_bpf_ctx())
 		release_sock(sk);
 	return 0;