diff mbox series

[net,v2] tcp: correct handling of extreme memory squeeze

Message ID 20250117214035.2414668-1-jmaloy@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] tcp: correct handling of extreme memory squeeze | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 3 maintainers not CCed: pabeni@redhat.com horms@kernel.org dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 16 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Jon Maloy Jan. 17, 2025, 9:40 p.m. UTC
From: Jon Maloy <jmaloy@redhat.com>

Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
[5201<->39222]:     __tcp_transmit_skb(->)
[5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
has been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024
v2: -Improved commit log to clarify how we end up in this situation.
    -After feedback from Eric Dumazet, removed references to use of
     SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
     about how this situation occurs. Those flags are used at the
     peer side's incoming connection, and not on this one.
---
 net/ipv4/tcp_output.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Jan. 17, 2025, 10:09 p.m. UTC | #1
On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
> [5201<->39222]:     __tcp_transmit_skb(->)
> [5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
> has been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> v1: -Posted on Apr 6, 2024

Could you post the link, this was a long time ago and I forgot the context.

> v2: -Improved commit log to clarify how we end up in this situation.
>     -After feedback from Eric Dumazet, removed references to use of
>      SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
>      about how this situation occurs. Those flags are used at the
>      peer side's incoming connection, and not on this one.
> ---
>  net/ipv4/tcp_output.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..ba295f798e5e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
>         u32 cur_win, new_win;
>
>         /* Make the window 0 if we failed to queue the data because we
> -        * are out of memory. The window is temporary, so we don't store
> -        * it on the socket.
> +        * are out of memory.
>          */
> -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> +               tp->rcv_wnd = 0;
> +               tp->rcv_wup = tp->rcv_nxt;

I wonder if we should not clear tp->pred_flags here ?

Also, any chance you could provide a packetdrill test ?

Your changelog contains traces that are hard to follow.

Thanks.
Stefano Brivio Jan. 17, 2025, 10:27 p.m. UTC | #2
[Fixed Cc: for Menglong Dong, this is a reply to:
 https://lore.kernel.org/all/CANn89i+Ks52JVTBsMFQBM4CqUR4cegXhbSCH77aMCqFpd-S_1A@mail.gmail.com/]

On Fri, 17 Jan 2025 23:09:27 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> > v1: -Posted on Apr 6, 2024  
> 
> Could you post the link, this was a long time ago and I forgot the context.

https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..ba295f798e5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,13 @@  static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);