diff mbox series

[net,v3] tcp: correct handling of extreme memory squeeze

Message ID 20250127231304.1465565-1-jmaloy@redhat.com (mailing list archive)
State Accepted
Commit 8c670bdfa58e48abad1d5b6ca1ee843ca91f7303
Delegated to: Netdev Maintainers
Headers show
Series [net,v3] tcp: correct handling of extreme memory squeeze | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: imagedong@tencent.com; 4 maintainers not CCed: dsahern@kernel.org imagedong@tencent.com horms@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-01-29--12-00 (tests: 885)

Commit Message

Jon Maloy Jan. 27, 2025, 11:13 p.m. UTC
From: Jon Maloy <jmaloy@redhat.com>

Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
                       [OFO queue: gap: 65480, len: 0]
[5201<->39222]:     tcp_data_queue(<-)
[5201<->39222]:     __tcp_transmit_skb(->)
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:       tcp_select_window(->)
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
                        returning 0
[5201<->39222]:       tcp_select_window(<-)
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     [__tcp_transmit_skb(<-)
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
                  returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
                  returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
                  returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024:
     https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r
v2: -Improved commit log to clarify how we end up in this situation.
    -After feedback from Eric Dumazet, removed references to use of
     SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
     about how this situation occurs. Those flags are used at the
     peer side's incoming connection, and not on this one.
v3: -Clearing tp->pred_flags when we do a zero advertisement.
    -Edited (manually) and shortened the log documenting the problem.
---
 net/ipv4/tcp_output.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Jason Xing Jan. 28, 2025, 12:52 a.m. UTC | #1
On Tue, Jan 28, 2025 at 7:13 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>                        [OFO queue: gap: 65480, len: 0]
> [5201<->39222]:     tcp_data_queue(<-)
> [5201<->39222]:     __tcp_transmit_skb(->)
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:       tcp_select_window(->)
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>                         returning 0
> [5201<->39222]:       tcp_select_window(<-)
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     [__tcp_transmit_skb(<-)
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>                   returning 6104 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>                   returning 131072 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>                   returning 54672 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by updating the socket state correctly when a packet has
> been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
> Cc: Menglong Dong <menglong8.dong@gmail.com>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>

I reckon that adding more description about why this case can be
triggered (the reason behind this case is the other side using other
kernels doesn't periodically send a window probe) is really necessary.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason
Eric Dumazet Jan. 28, 2025, 3:04 p.m. UTC | #2
On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>                        [OFO queue: gap: 65480, len: 0]
> [5201<->39222]:     tcp_data_queue(<-)
> [5201<->39222]:     __tcp_transmit_skb(->)
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:       tcp_select_window(->)
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>                         returning 0
> [5201<->39222]:       tcp_select_window(<-)
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     [__tcp_transmit_skb(<-)
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>                   returning 6104 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>                   returning 131072 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>                   returning 54672 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.

This so-called 'deadlock' only occurs if a remote TCP stack is unable
to send win0 probes.

In this case, sending some ACK will not help reliably if these ACK get lost.

I find the description tries very hard to hide a bug in another stack,
for some reason.

When under memory stress, not sending an opening ACK as fast as possible,
giving time for the host to recover from this memory stress was also a
sensible thing to do.

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks for the fix.
Stefano Brivio Jan. 28, 2025, 3:18 p.m. UTC | #3
On Tue, 28 Jan 2025 16:04:35 +0100
Eric Dumazet <edumazet@google.com> wrote:

> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
> 
> In this case, sending some ACK will not help reliably if these ACK get lost.
> 
> I find the description tries very hard to hide a bug in another stack,
> for some reason.

Side note: that was fixed meanwhile. Back then, at a first analysis, we
thought it was a workaround, but it's the actual fix as we *must* send
zero-window probes in that situation, and this commit triggers them:

  https://passt.top/passt/commit/?id=a740e16fd1b9bdca8d259aa6d37f942a3874425c
Neal Cardwell Jan. 28, 2025, 3:56 p.m. UTC | #4
On Tue, Jan 28, 2025 at 10:04 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing with iperf3 using the "pasta" protocol splicer has revealed
> > a bug in the way tcp handles window advertising in extreme memory
> > squeeze situations.
> >
> > Under memory pressure, a socket endpoint may temporarily advertise
> > a zero-sized window, but this is not stored as part of the socket data.
> > The reasoning behind this is that it is considered a temporary setting
> > which shouldn't influence any further calculations.
> >
> > However, if we happen to stall at an unfortunate value of the current
> > window size, the algorithm selecting a new value will consistently fail
> > to advertise a non-zero window once we have freed up enough memory.
> > This means that this side's notion of the current window size is
> > different from the one last advertised to the peer, causing the latter
> > to not send any data to resolve the sitution.
> >
> > The problem occurs on the iperf3 server side, and the socket in question
> > is a completely regular socket with the default settings for the
> > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
> >
> > The following excerpt of a logging session, with own comments added,
> > shows more in detail what is happening:
> >
> > //              tcp_v4_rcv(->)
> > //                tcp_rcv_established(->)
> > [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> > [5201<->39222]:     tcp_data_queue(->)
> > [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> >                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> >                        [OFO queue: gap: 65480, len: 0]
> > [5201<->39222]:     tcp_data_queue(<-)
> > [5201<->39222]:     __tcp_transmit_skb(->)
> >                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:       tcp_select_window(->)
> > [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
> >                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> >                         returning 0
> > [5201<->39222]:       tcp_select_window(<-)
> > [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> > [5201<->39222]:     [__tcp_transmit_skb(<-)
> > [5201<->39222]:   tcp_rcv_established(<-)
> > [5201<->39222]: tcp_v4_rcv(<-)
> >
> > // Receive queue is at 85 buffers and we are out of memory.
> > // We drop the incoming buffer, although it is in sequence, and decide
> > // to send an advertisement with a window of zero.
> > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> > // we unconditionally shrink the window.
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
> >                   returning 6104 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // After each read, the algorithm for calculating the new receive
> > // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> > // or to update tp->rcv_wnd.
> > // Meanwhile, the peer thinks the window is zero, and will not send
> > // any more data to trigger an update from the interrupt mode side.
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
> >                   returning 131072 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // The above pattern repeats again and again, since nothing changes
> > // between the reads.
> >
> > [...]
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
> >                   returning 54672 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // The receive queue is empty, but no new advertisement has been sent.
> > // The peer still thinks the receive window is zero, and sends nothing.
> > // We have ended up in a deadlock situation.
>
> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
>
> In this case, sending some ACK will not help reliably if these ACK get lost.
>
> I find the description tries very hard to hide a bug in another stack,
> for some reason.
>
> When under memory stress, not sending an opening ACK as fast as possible,
> giving time for the host to recover from this memory stress was also a
> sensible thing to do.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
>
> Thanks for the fix.

Yes, thanks for the fix. LGTM as well.

Reviewed-by: Neal Cardwell <ncardwell@google.com>

BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an
SNMP stat for this case, since we have SNMP stat increases for other
0-window cases. That could help debugging performance problems from
memory pressure and zero windows. But that can be in a separate patch
for net-next once this fix is in net-next.

neal
Eric Dumazet Jan. 28, 2025, 4:01 p.m. UTC | #5
On Tue, Jan 28, 2025 at 4:57 PM Neal Cardwell <ncardwell@google.com> wrote:

>
> Yes, thanks for the fix. LGTM as well.
>
> Reviewed-by: Neal Cardwell <ncardwell@google.com>
>
> BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an
> SNMP stat for this case, since we have SNMP stat increases for other
> 0-window cases. That could help debugging performance problems from
> memory pressure and zero windows. But that can be in a separate patch
> for net-next once this fix is in net-next.

This is LINUX_MIB_TCPRCVQDROP  ( TCPRcvQDrop )
Jon Maloy Jan. 28, 2025, 4:51 p.m. UTC | #6
On 2025-01-28 10:04, Eric Dumazet wrote:
> On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
>>
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> //              tcp_v4_rcv(->)
>> //                tcp_rcv_established(->)
>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]:     tcp_data_queue(->)
>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                         [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>>                         [OFO queue: gap: 65480, len: 0]
>> [5201<->39222]:     tcp_data_queue(<-)
>> [5201<->39222]:     __tcp_transmit_skb(->)
>>                          [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:       tcp_select_window(->)
>> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>>                          [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>>                          returning 0
>> [5201<->39222]:       tcp_select_window(<-)
>> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
>> [5201<->39222]:     [__tcp_transmit_skb(<-)
>> [5201<->39222]:   tcp_rcv_established(<-)
>> [5201<->39222]: tcp_v4_rcv(<-)
>>
>> // Receive queue is at 85 buffers and we are out of memory.
>> // We drop the incoming buffer, although it is in sequence, and decide
>> // to send an advertisement with a window of zero.
>> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
>> // we unconditionally shrink the window.
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>>                    returning 6104 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // After each read, the algorithm for calculating the new receive
>> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
>> // or to update tp->rcv_wnd.
>> // Meanwhile, the peer thinks the window is zero, and will not send
>> // any more data to trigger an update from the interrupt mode side.
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>>                    returning 131072 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // The above pattern repeats again and again, since nothing changes
>> // between the reads.
>>
>> [...]
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>>                    returning 54672 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // The receive queue is empty, but no new advertisement has been sent.
>> // The peer still thinks the receive window is zero, and sends nothing.
>> // We have ended up in a deadlock situation.
> 
> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
> 
> In this case, sending some ACK will not help reliably if these ACK get lost.
> 
> I find the description tries very hard to hide a bug in another stack,
> for some reason.

I clearly stated in a previous comment that this was the case, and that
it has been fixed now. My reason for posting this is because I still
think this is a bug, just as I think the way we use rcv_ssthresh in 
_tcp_select)window() is a bug that eventually should be fixed.

> 
> When under memory stress, not sending an opening ACK as fast as possible,
> giving time for the host to recover from this memory stress was also a
> sensible thing to do.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> 
> Thanks for the fix.
>
Eric Dumazet Jan. 28, 2025, 5:11 p.m. UTC | #7
On Tue, Jan 28, 2025 at 5:51 PM Jon Maloy <jmaloy@redhat.com> wrote:

> I clearly stated in a previous comment that this was the case, and that
> it has been fixed now. My reason for posting this is because I still
> think this is a bug, just as I think the way we use rcv_ssthresh in
> _tcp_select)window() is a bug that eventually should be fixed.

I was referring to a wrong statement in the changelog, claiming a
'deadlock situation' ...

It is pretty clear there is no deadlock here, unless the remote TCP
stack is _absolutely_ _broken_.

If you still want to capture this in an official changelog, it would
be nice to clarify this,
to avoid yet another CVE to be filled based on scary sentences
misleading many teams
in the world.

Keep changelogs accurate and factual, so that we can find useful
signals in them.

All your __tcp_cleanup_rbuf() repetitions are simply noise. It does not matter
if it is called once or ten times.
patchwork-bot+netdevbpf@kernel.org Jan. 30, 2025, 3:10 a.m. UTC | #8
Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 27 Jan 2025 18:13:04 -0500 you wrote:
> From: Jon Maloy <jmaloy@redhat.com>
> 
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
> 
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
> 
> [...]

Here is the summary with links:
  - [net,v3] tcp: correct handling of extreme memory squeeze
    https://git.kernel.org/netdev/net/c/8c670bdfa58e

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..bc95d2a5924f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,14 @@  static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->pred_flags = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);