Message ID | 20250127231304.1465565-1-jmaloy@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 8c670bdfa58e48abad1d5b6ca1ee843ca91f7303 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v3] tcp: correct handling of extreme memory squeeze | expand |
On Tue, Jan 28, 2025 at 7:13 AM <jmaloy@redhat.com> wrote: > > From: Jon Maloy <jmaloy@redhat.com> > > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. > > Under memory pressure, a socket endpoint may temporarily advertise > a zero-sized window, but this is not stored as part of the socket data. > The reasoning behind this is that it is considered a temporary setting > which shouldn't influence any further calculations. > > However, if we happen to stall at an unfortunate value of the current > window size, the algorithm selecting a new value will consistently fail > to advertise a non-zero window once we have freed up enough memory. > This means that this side's notion of the current window size is > different from the one last advertised to the peer, causing the latter > to not send any data to resolve the sitution. > > The problem occurs on the iperf3 server side, and the socket in question > is a completely regular socket with the default settings for the > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > The following excerpt of a logging session, with own comments added, > shows more in detail what is happening: > > // tcp_v4_rcv(->) > // tcp_rcv_established(->) > [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== > [5201<->39222]: tcp_data_queue(->) > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] > [OFO queue: gap: 65480, len: 0] > [5201<->39222]: tcp_data_queue(<-) > [5201<->39222]: __tcp_transmit_skb(->) > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: tcp_select_window(->) > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > returning 0 > [5201<->39222]: tcp_select_window(<-) > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > [5201<->39222]: [__tcp_transmit_skb(<-) > [5201<->39222]: tcp_rcv_established(<-) > [5201<->39222]: tcp_v4_rcv(<-) > > // Receive queue is at 85 buffers and we are out of memory. > // We drop the incoming buffer, although it is in sequence, and decide > // to send an advertisement with a window of zero. > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > // we unconditionally shrink the window. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] > returning 6104 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // After each read, the algorithm for calculating the new receive > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > // or to update tp->rcv_wnd. > // Meanwhile, the peer thinks the window is zero, and will not send > // any more data to trigger an update from the interrupt mode side. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] > returning 131072 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The above pattern repeats again and again, since nothing changes > // between the reads. > > [...] > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] > returning 54672 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The receive queue is empty, but no new advertisement has been sent. > // The peer still thinks the receive window is zero, and sends nothing. > // We have ended up in a deadlock situation. > > Furthermore, we have observed that in these situations this side may > send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup > as it should be. Backing ack_seq seems to be harmless, but is of > course still wrong from a protocol viewpoint. > > We fix this by updating the socket state correctly when a packet has > been dropped because of memory exhaustion and we have to advertize > a zero window. > > Further testing shows that the connection recovers neatly from the > squeeze situation, and traffic can continue indefinitely. > > Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") > Cc: Menglong Dong <menglong8.dong@gmail.com> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com> > Signed-off-by: Jon Maloy <jmaloy@redhat.com> I reckon that adding more description about why this case can be triggered (the reason behind this case is the other side using other kernels doesn't periodically send a window probe) is really necessary. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Thanks, Jason
On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote: > > From: Jon Maloy <jmaloy@redhat.com> > > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. > > Under memory pressure, a socket endpoint may temporarily advertise > a zero-sized window, but this is not stored as part of the socket data. > The reasoning behind this is that it is considered a temporary setting > which shouldn't influence any further calculations. > > However, if we happen to stall at an unfortunate value of the current > window size, the algorithm selecting a new value will consistently fail > to advertise a non-zero window once we have freed up enough memory. > This means that this side's notion of the current window size is > different from the one last advertised to the peer, causing the latter > to not send any data to resolve the sitution. > > The problem occurs on the iperf3 server side, and the socket in question > is a completely regular socket with the default settings for the > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > The following excerpt of a logging session, with own comments added, > shows more in detail what is happening: > > // tcp_v4_rcv(->) > // tcp_rcv_established(->) > [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== > [5201<->39222]: tcp_data_queue(->) > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] > [OFO queue: gap: 65480, len: 0] > [5201<->39222]: tcp_data_queue(<-) > [5201<->39222]: __tcp_transmit_skb(->) > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: tcp_select_window(->) > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > returning 0 > [5201<->39222]: tcp_select_window(<-) > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > [5201<->39222]: [__tcp_transmit_skb(<-) > [5201<->39222]: tcp_rcv_established(<-) > [5201<->39222]: tcp_v4_rcv(<-) > > // Receive queue is at 85 buffers and we are out of memory. > // We drop the incoming buffer, although it is in sequence, and decide > // to send an advertisement with a window of zero. > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > // we unconditionally shrink the window. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] > returning 6104 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // After each read, the algorithm for calculating the new receive > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > // or to update tp->rcv_wnd. > // Meanwhile, the peer thinks the window is zero, and will not send > // any more data to trigger an update from the interrupt mode side. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] > returning 131072 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The above pattern repeats again and again, since nothing changes > // between the reads. > > [...] > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] > returning 54672 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The receive queue is empty, but no new advertisement has been sent. > // The peer still thinks the receive window is zero, and sends nothing. > // We have ended up in a deadlock situation. This so-called 'deadlock' only occurs if a remote TCP stack is unable to send win0 probes. In this case, sending some ACK will not help reliably if these ACK get lost. I find the description tries very hard to hide a bug in another stack, for some reason. When under memory stress, not sending an opening ACK as fast as possible, giving time for the host to recover from this memory stress was also a sensible thing to do. Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks for the fix.
On Tue, 28 Jan 2025 16:04:35 +0100 Eric Dumazet <edumazet@google.com> wrote: > This so-called 'deadlock' only occurs if a remote TCP stack is unable > to send win0 probes. > > In this case, sending some ACK will not help reliably if these ACK get lost. > > I find the description tries very hard to hide a bug in another stack, > for some reason. Side note: that was fixed meanwhile. Back then, at a first analysis, we thought it was a workaround, but it's the actual fix as we *must* send zero-window probes in that situation, and this commit triggers them: https://passt.top/passt/commit/?id=a740e16fd1b9bdca8d259aa6d37f942a3874425c
On Tue, Jan 28, 2025 at 10:04 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote: > > > > From: Jon Maloy <jmaloy@redhat.com> > > > > Testing with iperf3 using the "pasta" protocol splicer has revealed > > a bug in the way tcp handles window advertising in extreme memory > > squeeze situations. > > > > Under memory pressure, a socket endpoint may temporarily advertise > > a zero-sized window, but this is not stored as part of the socket data. > > The reasoning behind this is that it is considered a temporary setting > > which shouldn't influence any further calculations. > > > > However, if we happen to stall at an unfortunate value of the current > > window size, the algorithm selecting a new value will consistently fail > > to advertise a non-zero window once we have freed up enough memory. > > This means that this side's notion of the current window size is > > different from the one last advertised to the peer, causing the latter > > to not send any data to resolve the sitution. > > > > The problem occurs on the iperf3 server side, and the socket in question > > is a completely regular socket with the default settings for the > > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > > > The following excerpt of a logging session, with own comments added, > > shows more in detail what is happening: > > > > // tcp_v4_rcv(->) > > // tcp_rcv_established(->) > > [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== > > [5201<->39222]: tcp_data_queue(->) > > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > > [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] > > [OFO queue: gap: 65480, len: 0] > > [5201<->39222]: tcp_data_queue(<-) > > [5201<->39222]: __tcp_transmit_skb(->) > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > > [5201<->39222]: tcp_select_window(->) > > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > > returning 0 > > [5201<->39222]: tcp_select_window(<-) > > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > > [5201<->39222]: [__tcp_transmit_skb(<-) > > [5201<->39222]: tcp_rcv_established(<-) > > [5201<->39222]: tcp_v4_rcv(<-) > > > > // Receive queue is at 85 buffers and we are out of memory. > > // We drop the incoming buffer, although it is in sequence, and decide > > // to send an advertisement with a window of zero. > > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > > // we unconditionally shrink the window. > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] > > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > > [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] > > returning 6104 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // After each read, the algorithm for calculating the new receive > > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > > // or to update tp->rcv_wnd. > > // Meanwhile, the peer thinks the window is zero, and will not send > > // any more data to trigger an update from the interrupt mode side. > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > > [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] > > returning 131072 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // The above pattern repeats again and again, since nothing changes > > // between the reads. > > > > [...] > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] > > [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] > > [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] > > returning 54672 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // The receive queue is empty, but no new advertisement has been sent. > > // The peer still thinks the receive window is zero, and sends nothing. > > // We have ended up in a deadlock situation. > > This so-called 'deadlock' only occurs if a remote TCP stack is unable > to send win0 probes. > > In this case, sending some ACK will not help reliably if these ACK get lost. > > I find the description tries very hard to hide a bug in another stack, > for some reason. > > When under memory stress, not sending an opening ACK as fast as possible, > giving time for the host to recover from this memory stress was also a > sensible thing to do. > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > Thanks for the fix. Yes, thanks for the fix. LGTM as well. Reviewed-by: Neal Cardwell <ncardwell@google.com> BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an SNMP stat for this case, since we have SNMP stat increases for other 0-window cases. That could help debugging performance problems from memory pressure and zero windows. But that can be in a separate patch for net-next once this fix is in net-next. neal
On Tue, Jan 28, 2025 at 4:57 PM Neal Cardwell <ncardwell@google.com> wrote: > > Yes, thanks for the fix. LGTM as well. > > Reviewed-by: Neal Cardwell <ncardwell@google.com> > > BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an > SNMP stat for this case, since we have SNMP stat increases for other > 0-window cases. That could help debugging performance problems from > memory pressure and zero windows. But that can be in a separate patch > for net-next once this fix is in net-next. This is LINUX_MIB_TCPRCVQDROP ( TCPRcvQDrop )
On 2025-01-28 10:04, Eric Dumazet wrote: > On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote: >> >> From: Jon Maloy <jmaloy@redhat.com> >> >> Testing with iperf3 using the "pasta" protocol splicer has revealed >> a bug in the way tcp handles window advertising in extreme memory >> squeeze situations. >> >> Under memory pressure, a socket endpoint may temporarily advertise >> a zero-sized window, but this is not stored as part of the socket data. >> The reasoning behind this is that it is considered a temporary setting >> which shouldn't influence any further calculations. >> >> However, if we happen to stall at an unfortunate value of the current >> window size, the algorithm selecting a new value will consistently fail >> to advertise a non-zero window once we have freed up enough memory. >> This means that this side's notion of the current window size is >> different from the one last advertised to the peer, causing the latter >> to not send any data to resolve the sitution. >> >> The problem occurs on the iperf3 server side, and the socket in question >> is a completely regular socket with the default settings for the >> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. >> >> The following excerpt of a logging session, with own comments added, >> shows more in detail what is happening: >> >> // tcp_v4_rcv(->) >> // tcp_rcv_established(->) >> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== >> [5201<->39222]: tcp_data_queue(->) >> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM >> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] >> [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] >> [OFO queue: gap: 65480, len: 0] >> [5201<->39222]: tcp_data_queue(<-) >> [5201<->39222]: __tcp_transmit_skb(->) >> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] >> [5201<->39222]: tcp_select_window(->) >> [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE >> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] >> returning 0 >> [5201<->39222]: tcp_select_window(<-) >> [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 >> [5201<->39222]: [__tcp_transmit_skb(<-) >> [5201<->39222]: tcp_rcv_established(<-) >> [5201<->39222]: tcp_v4_rcv(<-) >> >> // Receive queue is at 85 buffers and we are out of memory. >> // We drop the incoming buffer, although it is in sequence, and decide >> // to send an advertisement with a window of zero. >> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means >> // we unconditionally shrink the window. >> >> [5201<->39222]: tcp_recvmsg_locked(->) >> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 >> [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] >> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] >> [5201<->39222]: NOT calling tcp_send_ack() >> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] >> [5201<->39222]: __tcp_cleanup_rbuf(<-) >> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] >> [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] >> returning 6104 bytes >> [5201<->39222]: tcp_recvmsg_locked(<-) >> >> // After each read, the algorithm for calculating the new receive >> // window in __tcp_cleanup_rbuf() finds it is too small to advertise >> // or to update tp->rcv_wnd. >> // Meanwhile, the peer thinks the window is zero, and will not send >> // any more data to trigger an update from the interrupt mode side. >> >> [5201<->39222]: tcp_recvmsg_locked(->) >> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 >> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] >> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] >> [5201<->39222]: NOT calling tcp_send_ack() >> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] >> [5201<->39222]: __tcp_cleanup_rbuf(<-) >> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] >> [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] >> returning 131072 bytes >> [5201<->39222]: tcp_recvmsg_locked(<-) >> >> // The above pattern repeats again and again, since nothing changes >> // between the reads. >> >> [...] >> >> [5201<->39222]: tcp_recvmsg_locked(->) >> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 >> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] >> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] >> [5201<->39222]: NOT calling tcp_send_ack() >> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] >> [5201<->39222]: __tcp_cleanup_rbuf(<-) >> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] >> [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] >> returning 54672 bytes >> [5201<->39222]: tcp_recvmsg_locked(<-) >> >> // The receive queue is empty, but no new advertisement has been sent. >> // The peer still thinks the receive window is zero, and sends nothing. >> // We have ended up in a deadlock situation. > > This so-called 'deadlock' only occurs if a remote TCP stack is unable > to send win0 probes. > > In this case, sending some ACK will not help reliably if these ACK get lost. > > I find the description tries very hard to hide a bug in another stack, > for some reason. I clearly stated in a previous comment that this was the case, and that it has been fixed now. My reason for posting this is because I still think this is a bug, just as I think the way we use rcv_ssthresh in _tcp_select)window() is a bug that eventually should be fixed. > > When under memory stress, not sending an opening ACK as fast as possible, > giving time for the host to recover from this memory stress was also a > sensible thing to do. > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > Thanks for the fix. >
On Tue, Jan 28, 2025 at 5:51 PM Jon Maloy <jmaloy@redhat.com> wrote: > I clearly stated in a previous comment that this was the case, and that > it has been fixed now. My reason for posting this is because I still > think this is a bug, just as I think the way we use rcv_ssthresh in > _tcp_select)window() is a bug that eventually should be fixed. I was referring to a wrong statement in the changelog, claiming a 'deadlock situation' ... It is pretty clear there is no deadlock here, unless the remote TCP stack is _absolutely_ _broken_. If you still want to capture this in an official changelog, it would be nice to clarify this, to avoid yet another CVE to be filled based on scary sentences misleading many teams in the world. Keep changelogs accurate and factual, so that we can find useful signals in them. All your __tcp_cleanup_rbuf() repetitions are simply noise. It does not matter if it is called once or ten times.
Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Mon, 27 Jan 2025 18:13:04 -0500 you wrote: > From: Jon Maloy <jmaloy@redhat.com> > > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. > > Under memory pressure, a socket endpoint may temporarily advertise > a zero-sized window, but this is not stored as part of the socket data. > The reasoning behind this is that it is considered a temporary setting > which shouldn't influence any further calculations. > > [...] Here is the summary with links: - [net,v3] tcp: correct handling of extreme memory squeeze https://git.kernel.org/netdev/net/c/8c670bdfa58e You are awesome, thank you!
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..bc95d2a5924f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -265,11 +265,14 @@ static u16 tcp_select_window(struct sock *sk) u32 cur_win, new_win; /* Make the window 0 if we failed to queue the data because we - * are out of memory. The window is temporary, so we don't store - * it on the socket. + * are out of memory. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + tp->pred_flags = 0; + tp->rcv_wnd = 0; + tp->rcv_wup = tp->rcv_nxt; return 0; + } cur_win = tcp_receive_window(tp); new_win = __tcp_select_window(sk);