Message ID | 20240801111611.84743-1-kuro@kuroa.me (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp: fix forever orphan socket caused by tcp_abort | expand |
On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@kuroa.me> wrote: > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > environment. This patch come from the investigation. > > Previously tcp_abort only sends out reset and calls tcp_done when the > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only > purging the write queue, but not close the socket and left it to the > timer. > > While purging the write queue, tp->packets_out and sk->sk_write_queue > is cleared along the way. However tcp_retransmit_timer have early > return based on !tp->packets_out and tcp_probe_timer have early > return based on !sk->sk_write_queue. > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > and socket not being killed by the timers. Converting a zero-windowed > orphan to a forever orphan. > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > reset to peer and close the socket accordingly. Preventing the > timer-less orphan from happening. > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") > Signed-off-by: Xueming Feng <kuro@kuroa.me> This seems legit, but are you sure these two blamed commits added this bug ? Even before them, we should have called tcp_done() right away, instead of waiting for a (possibly long) timer to complete the job. This might be important when killing millions of sockets on a busy server. CC Lorenzo Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? Thanks.
Hello Eric, On Thu, Aug 1, 2024 at 9:17 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@kuroa.me> wrote: > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > > environment. This patch come from the investigation. > > > > Previously tcp_abort only sends out reset and calls tcp_done when the > > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only > > purging the write queue, but not close the socket and left it to the > > timer. > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue > > is cleared along the way. However tcp_retransmit_timer have early > > return based on !tp->packets_out and tcp_probe_timer have early > > return based on !sk->sk_write_queue. > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > > and socket not being killed by the timers. Converting a zero-windowed > > orphan to a forever orphan. > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > reset to peer and close the socket accordingly. Preventing the > > timer-less orphan from happening. > > > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") > > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") > > Signed-off-by: Xueming Feng <kuro@kuroa.me> > > This seems legit, but are you sure these two blamed commits added this bug ? > > Even before them, we should have called tcp_done() right away, instead > of waiting for a (possibly long) timer to complete the job. > > This might be important when killing millions of sockets on a busy server. > > CC Lorenzo > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? I guess that one of possible reasons is to avoid double-free, something like this, happening in inet_csk_destroy_sock(). Let me assume: if we call tcp_close() first under the memory pressure which means tcp_check_oom() returns true and then it will call inet_csk_destroy_sock() in __tcp_close(), later tcp_abort() will call tcp_done() to free the sk again in the inet_csk_destroy_sock() when not testing the SOCK_DEAD flag in tcp_abort. Do you think the above case could happen? Thanks, Jason
On Sat, Aug 3, 2024 at 11:48 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Thu, Aug 1, 2024 at 9:17 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@kuroa.me> wrote: > > > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > > > environment. This patch come from the investigation. > > > > > > Previously tcp_abort only sends out reset and calls tcp_done when the > > > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only > > > purging the write queue, but not close the socket and left it to the > > > timer. > > > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue > > > is cleared along the way. However tcp_retransmit_timer have early > > > return based on !tp->packets_out and tcp_probe_timer have early > > > return based on !sk->sk_write_queue. > > > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > > > and socket not being killed by the timers. Converting a zero-windowed > > > orphan to a forever orphan. > > > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > > reset to peer and close the socket accordingly. Preventing the > > > timer-less orphan from happening. > > > > > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") > > > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") > > > Signed-off-by: Xueming Feng <kuro@kuroa.me> > > > > This seems legit, but are you sure these two blamed commits added this bug ? > > > > Even before them, we should have called tcp_done() right away, instead > > of waiting for a (possibly long) timer to complete the job. > > > > This might be important when killing millions of sockets on a busy server. > > > > CC Lorenzo > > > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? > > I guess that one of possible reasons is to avoid double-free, > something like this, happening in inet_csk_destroy_sock(). > > Let me assume: if we call tcp_close() first under the memory pressure > which means tcp_check_oom() returns true and then it will call > inet_csk_destroy_sock() in __tcp_close(), later tcp_abort() will call > tcp_done() to free the sk again in the inet_csk_destroy_sock() when > not testing the SOCK_DEAD flag in tcp_abort. > How about this one which can prevent double calling inet_csk_destroy_sock() when we call destroy and close nearly at the same time under that circumstance: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03a342c9162..d5d3b21cc824 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4646,7 +4646,7 @@ int tcp_abort(struct sock *sk, int err) local_bh_disable(); bh_lock_sock(sk); - if (!sock_flag(sk, SOCK_DEAD)) { + if (sk->sk_state != TCP_CLOSE) { if (tcp_need_reset(sk->sk_state)) tcp_send_active_reset(sk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED); Each time we call inet_csk_destroy_sock(), we must make sure we've already set the state to TCP_CLOSE. Based on this, I think we can use this as an indicator to avoid calling twice to destroy the socket. Thanks, Jason
On Thu, Aug 1, 2024 at 10:11 PM Eric Dumazet <edumazet@google.com> wrote: > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > reset to peer and close the socket accordingly. Preventing the > > timer-less orphan from happening. > > [...] > > This seems legit, but are you sure these two blamed commits added this bug ? > > Even before them, we should have called tcp_done() right away, instead > of waiting for a (possibly long) timer to complete the job. > > This might be important when killing millions of sockets on a busy server. > > CC Lorenzo > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? I think I took it from the original tcp_nuke_addr implementation that Android used before SOCK_DESTROY and tcp_abort were written. The oldest reference I could find to that code is this commit that went into 2.6.39 (!), which already had that check. https://android.googlesource.com/kernel/common/+/06611218f86dc353d5dd0cb5acac32a0863a2ae5 I expect the check was intended to prevent force-closing the same socket twice.
On Mon, Aug 5, 2024 at 6:52 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Sat, Aug 3, 2024 at 11:48 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Thu, Aug 1, 2024 at 9:17 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@kuroa.me> wrote: > > > > > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > > > > environment. This patch come from the investigation. > > > > > > > > Previously tcp_abort only sends out reset and calls tcp_done when the > > > > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only > > > > purging the write queue, but not close the socket and left it to the > > > > timer. > > > > > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue > > > > is cleared along the way. However tcp_retransmit_timer have early > > > > return based on !tp->packets_out and tcp_probe_timer have early > > > > return based on !sk->sk_write_queue. > > > > > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > > > > and socket not being killed by the timers. Converting a zero-windowed > > > > orphan to a forever orphan. > > > > > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > > > reset to peer and close the socket accordingly. Preventing the > > > > timer-less orphan from happening. > > > > > > > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") > > > > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") > > > > Signed-off-by: Xueming Feng <kuro@kuroa.me> > > > > > > This seems legit, but are you sure these two blamed commits added this bug ? > > > > > > Even before them, we should have called tcp_done() right away, instead > > > of waiting for a (possibly long) timer to complete the job. > > > > > > This might be important when killing millions of sockets on a busy server. > > > > > > CC Lorenzo > > > > > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? > > > > I guess that one of possible reasons is to avoid double-free, > > something like this, happening in inet_csk_destroy_sock(). > > > > Let me assume: if we call tcp_close() first under the memory pressure > > which means tcp_check_oom() returns true and then it will call > > inet_csk_destroy_sock() in __tcp_close(), later tcp_abort() will call > > tcp_done() to free the sk again in the inet_csk_destroy_sock() when > > not testing the SOCK_DEAD flag in tcp_abort. > > > > How about this one which can prevent double calling > inet_csk_destroy_sock() when we call destroy and close nearly at the > same time under that circumstance: > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index e03a342c9162..d5d3b21cc824 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -4646,7 +4646,7 @@ int tcp_abort(struct sock *sk, int err) > local_bh_disable(); > bh_lock_sock(sk); > > - if (!sock_flag(sk, SOCK_DEAD)) { > + if (sk->sk_state != TCP_CLOSE) { > if (tcp_need_reset(sk->sk_state)) > tcp_send_active_reset(sk, GFP_ATOMIC, > SK_RST_REASON_NOT_SPECIFIED); > > Each time we call inet_csk_destroy_sock(), we must make sure we've > already set the state to TCP_CLOSE. Based on this, I think we can use > this as an indicator to avoid calling twice to destroy the socket. I do not think this will work. With this patch, a listener socket will not get an error notification. Ideally we need tests for this seldom used feature.
On Mon, Aug 5, 2024 at 2:43 PM Lorenzo Colitti <lorenzo@google.com> wrote: > > On Thu, Aug 1, 2024 at 10:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > > reset to peer and close the socket accordingly. Preventing the > > > timer-less orphan from happening. > > > [...] > > > > This seems legit, but are you sure these two blamed commits added this bug ? > > > > Even before them, we should have called tcp_done() right away, instead > > of waiting for a (possibly long) timer to complete the job. > > > > This might be important when killing millions of sockets on a busy server. > > > > CC Lorenzo > > > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? > > I think I took it from the original tcp_nuke_addr implementation that > Android used before SOCK_DESTROY and tcp_abort were written. The > oldest reference I could find to that code is this commit that went > into 2.6.39 (!), which already had that check. > > https://android.googlesource.com/kernel/common/+/06611218f86dc353d5dd0cb5acac32a0863a2ae5 > > I expect the check was intended to prevent force-closing the same socket twice. > Yes, I guess so.
On Mon, Aug 5, 2024 at 3:23 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Aug 5, 2024 at 6:52 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Sat, Aug 3, 2024 at 11:48 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > Hello Eric, > > > > > > On Thu, Aug 1, 2024 at 9:17 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@kuroa.me> wrote: > > > > > > > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our > > > > > environment. This patch come from the investigation. > > > > > > > > > > Previously tcp_abort only sends out reset and calls tcp_done when the > > > > > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only > > > > > purging the write queue, but not close the socket and left it to the > > > > > timer. > > > > > > > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue > > > > > is cleared along the way. However tcp_retransmit_timer have early > > > > > return based on !tp->packets_out and tcp_probe_timer have early > > > > > return based on !sk->sk_write_queue. > > > > > > > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched > > > > > and socket not being killed by the timers. Converting a zero-windowed > > > > > orphan to a forever orphan. > > > > > > > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send > > > > > reset to peer and close the socket accordingly. Preventing the > > > > > timer-less orphan from happening. > > > > > > > > > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") > > > > > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") > > > > > Signed-off-by: Xueming Feng <kuro@kuroa.me> > > > > > > > > This seems legit, but are you sure these two blamed commits added this bug ? > > > > > > > > Even before them, we should have called tcp_done() right away, instead > > > > of waiting for a (possibly long) timer to complete the job. > > > > > > > > This might be important when killing millions of sockets on a busy server. > > > > > > > > CC Lorenzo > > > > > > > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ? > > > > > > I guess that one of possible reasons is to avoid double-free, > > > something like this, happening in inet_csk_destroy_sock(). > > > > > > Let me assume: if we call tcp_close() first under the memory pressure > > > which means tcp_check_oom() returns true and then it will call > > > inet_csk_destroy_sock() in __tcp_close(), later tcp_abort() will call > > > tcp_done() to free the sk again in the inet_csk_destroy_sock() when > > > not testing the SOCK_DEAD flag in tcp_abort. > > > > > > > How about this one which can prevent double calling > > inet_csk_destroy_sock() when we call destroy and close nearly at the > > same time under that circumstance: > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index e03a342c9162..d5d3b21cc824 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -4646,7 +4646,7 @@ int tcp_abort(struct sock *sk, int err) > > local_bh_disable(); > > bh_lock_sock(sk); > > > > - if (!sock_flag(sk, SOCK_DEAD)) { > > + if (sk->sk_state != TCP_CLOSE) { > > if (tcp_need_reset(sk->sk_state)) > > tcp_send_active_reset(sk, GFP_ATOMIC, > > SK_RST_REASON_NOT_SPECIFIED); > > > > Each time we call inet_csk_destroy_sock(), we must make sure we've > > already set the state to TCP_CLOSE. Based on this, I think we can use > > this as an indicator to avoid calling twice to destroy the socket. > > I do not think this will work. > > With this patch, a listener socket will not get an error notification. Oh, you're right. I think we can add this particular case in the if or if-else statement to handle. Thanks, Jason
On Mon, Aug 5, 2024 at 4:23 PM Eric Dumazet <edumazet@google.com> wrote: > > Each time we call inet_csk_destroy_sock(), we must make sure we've > > already set the state to TCP_CLOSE. Based on this, I think we can use > > this as an indicator to avoid calling twice to destroy the socket. > > I do not think this will work. > > With this patch, a listener socket will not get an error notification. > > Ideally we need tests for this seldom used feature. FWIW there is a fair amount of test coverage here: https://cs.android.com/android/platform/superproject/main/+/main:kernel/tests/net/test/sock_diag_test.py though unfortunately they don't pass on unmodified kernels (I didn't look into why - maybe Maciej knows). I ran the tests on the "v2-ish patch" and they all passed except for a test that expects that SOCK_DESTROY on a FIN_WAIT1 socket does nothing. That seems OK because it's the thing your patch is trying to fix. Just to confirm - it's OK to send a RST on a connection that's already in FIN_WAIT1 state? Is that allowed by the RFC?
On Wed, Aug 14, 2024 at 12:43 PM Lorenzo Colitti <lorenzo@google.com> wrote: > > On Mon, Aug 5, 2024 at 4:23 PM Eric Dumazet <edumazet@google.com> wrote: > > > Each time we call inet_csk_destroy_sock(), we must make sure we've > > > already set the state to TCP_CLOSE. Based on this, I think we can use > > > this as an indicator to avoid calling twice to destroy the socket. > > > > I do not think this will work. > > > > With this patch, a listener socket will not get an error notification. > > > > Ideally we need tests for this seldom used feature. > > FWIW there is a fair amount of test coverage here: > > https://cs.android.com/android/platform/superproject/main/+/main:kernel/tests/net/test/sock_diag_test.py > > though unfortunately they don't pass on unmodified kernels (I didn't > look into why - maybe Maciej knows). I ran the tests on the "v2-ish > patch" and they all passed except for a test that expects that > SOCK_DESTROY on a FIN_WAIT1 socket does nothing. That seems OK because > it's the thing your patch is trying to fix. > > Just to confirm - it's OK to send a RST on a connection that's already > in FIN_WAIT1 state? Is that allowed by the RFC? I think so. Please take a look at the following link which tells us whether we should send an RST: ABORT Call ESTABLISHED STATE FIN-WAIT-1 STATE FIN-WAIT-2 STATE CLOSE-WAIT STATE Send a reset segment: <SEQ=SND.NXT><CTL=RST> All queued SENDs and RECEIVEs should be given "connection reset" notification; all segments queued for transmission (except for the RST formed above) or retransmission should be flushed, delete the TCB, enter CLOSED state, and return. https://www.ietf.org/rfc/rfc793.txt#:~:text=Specification%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20ABORT%20Call-,ABORT%20Call,-CLOSED%20STATE%20(i Thanks, Jason
On Wed, Aug 14, 2024 at 1:43 PM Lorenzo Colitti <lorenzo@google.com> wrote: > though unfortunately they don't pass on unmodified kernels (I didn't > look into why - maybe Maciej knows). Actually, they do: just git clone https://android.googlesource.com/kernel/tests and from the kernel tree do path/to/net/test/run_net_test.sh sock_diag_test.py
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03a342c9162..65e8d28d15b1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4646,12 +4646,10 @@ int tcp_abort(struct sock *sk, int err) local_bh_disable(); bh_lock_sock(sk); - if (!sock_flag(sk, SOCK_DEAD)) { - if (tcp_need_reset(sk->sk_state)) - tcp_send_active_reset(sk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); - tcp_done_with_error(sk, err); - } + if (tcp_need_reset(sk->sk_state)) + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); + tcp_done_with_error(sk, err); bh_unlock_sock(sk); local_bh_enable();
We have some problem closing zero-window fin-wait-1 tcp sockets in our environment. This patch come from the investigation. Previously tcp_abort only sends out reset and calls tcp_done when the socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only purging the write queue, but not close the socket and left it to the timer. While purging the write queue, tp->packets_out and sk->sk_write_queue is cleared along the way. However tcp_retransmit_timer have early return based on !tp->packets_out and tcp_probe_timer have early return based on !sk->sk_write_queue. This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched and socket not being killed by the timers. Converting a zero-windowed orphan to a forever orphan. This patch removes the SOCK_DEAD check in tcp_abort, making it send reset to peer and close the socket accordingly. Preventing the timer-less orphan from happening. Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection") Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue") Signed-off-by: Xueming Feng <kuro@kuroa.me> --- net/ipv4/tcp.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)