[net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0

Message ID	20250410175140.10805-3-luizcmpc@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01C5C1E9B38 for <netdev@vger.kernel.org>; Thu, 10 Apr 2025 17:54:16 +0000 (UTC) From: Luiz Carvalho <luizcmpc@gmail.com> To: netdev@vger.kernel.org Cc: Luiz Carvalho <luizcmpc@gmail.com>, Eric Dumazet <edumazet@google.com>, Neal Cardwell <ncardwell@google.com> Subject: [PATCH net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0 Date: Thu, 10 Apr 2025 17:50:34 +0000 Message-ID: <20250410175140.10805-3-luizcmpc@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0 \| expand [net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	5 maintainers not CCed: kuba@kernel.org kuniyu@amazon.com dsahern@kernel.org pabeni@redhat.com horms@kernel.org
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 16 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 1 this patch: 1
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2025-04-15--21-00 (tests: 914)

Luiz Carlos Mourão Paes de Carvalho April 10, 2025, 5:50 p.m. UTC

The current tcp_acceptable_seq() returns SND.NXT when the available
window shrinks to less then one scaling factor. This works fine for most
cases, and seemed to not be a problem until a slight behavior change to
how tcp_select_window() handles ZeroWindow cases.

Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
a zero window would only be announced when data failed to be consumed,
and following packets would have non-zero windows despite the receiver
still not having any available space. After the commit, however, the
zero window is stored in the socket and the advertised window will be
zero until the receiver frees up space.

For tcp_acceptable_seq(), a zero window case will result in SND.NXT
being sent, but the problem now arises when the receptor validates the
sequence number in tcp_sequence():

static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
					 u32 seq, u32 end_seq)
{
	// ...
	if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
		return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
	// ...
}

Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
dropped by the receiver, correctly, as RFC793 mentions:

	There are four cases for the acceptability test for an incoming
        segment:

	Segment Receive  Test
        Length  Window
        ------- -------  -------------------------------------------

           0       0     SEG.SEQ = RCV.NXT

The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
and the connection continues. If the receptor announces ZeroWindow
again, the stall could be very long, as was in my case. Found this out
while giving a shot at bug #213827.

Could the precision loss lead to a zero window? If so, then this
approach probably won't work.

Fixes: a4ecb15a2432 ("tcp: accommodate sequence number to a peer's shrunk receive window caused by precision loss in window scaling")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213827
Signed-off-by: Luiz Carvalho <luizcmpc@gmail.com>
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Paolo Abeni April 15, 2025, 11:30 a.m. UTC | #1

On 4/10/25 7:50 PM, Luiz Carvalho wrote:
> The current tcp_acceptable_seq() returns SND.NXT when the available
> window shrinks to less then one scaling factor. This works fine for most
> cases, and seemed to not be a problem until a slight behavior change to
> how tcp_select_window() handles ZeroWindow cases.
> 
> Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> a zero window would only be announced when data failed to be consumed,
> and following packets would have non-zero windows despite the receiver
> still not having any available space. After the commit, however, the
> zero window is stored in the socket and the advertised window will be
> zero until the receiver frees up space.
> 
> For tcp_acceptable_seq(), a zero window case will result in SND.NXT
> being sent, but the problem now arises when the receptor validates the
> sequence number in tcp_sequence():
> 
> static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
> 					 u32 seq, u32 end_seq)
> {
> 	// ...
> 	if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> 		return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> 	// ...
> }
> 
> Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
> the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
> dropped by the receiver, correctly, as RFC793 mentions:
> 
> 	There are four cases for the acceptability test for an incoming
>         segment:
> 
> 	Segment Receive  Test
>         Length  Window
>         ------- -------  -------------------------------------------
> 
>            0       0     SEG.SEQ = RCV.NXT
> 
> The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
> and the connection continues. If the receptor announces ZeroWindow
> again, the stall could be very long, as was in my case. Found this out
> while giving a shot at bug #213827.

The dropped ack causing the stall is a zero window probe from the sender
right?
Could you please describe the relevant scenario with a pktdrill test?

Thanks!

Paolo

Luiz Carlos Mourão Paes de Carvalho April 16, 2025, 8:59 p.m. UTC | #2

Hi Paolo,

The dropped ack is a response to data sent by the peer.

Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT)
that gets dropped
by the peer's tcp_sequence function. Connection only advances when we
send a RTO.

Let me know if the following describes the scenario you expected. I'll
add a packetdrill with
the expected interaction to the patch if it makes sense.

// Tests the invalid SEQs sent by the listener
// which are then dropped by the client.

`./common/defaults.sh
./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
  +.1 < . 1:1(0) ack 1 win 8
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 990) = 990
   +0 > P. 1:991(990) ack 1
   +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer
being almost full, shrink_window=0

   +0 write(4, ..., 100) = 100
   +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
   +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data,
RCV.NXT=991, RCV.WND=0

 +0.1 < P. 1:1001(1000) ack 901 win 0
   +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence,
note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
                                          // if (after(seq,
tp->rcv_nxt + tcp_receive_window(tp)))
                                          //     return
SKB_DROP_REASON_TCP_INVALID_SEQUENCE;

 +0.2 > P. 991:1091(100) ack 1001         // this is a RTO, ack accepted
   +0 < P. 1001:2001(1000) ack 991 win 0  // peer responds, still no
space available, but has more data to send
   +0 > .  1091:1091(0) ack 2001          // ack dropped

 +0.3 > P. 991:1091(100) ack 2001         // RTO, ack accepted
   +0 < .  2001:3001(1000) ack 991 win 0  // still no space available,
but another chunk of data
   +0 > .  1091:1091(0) ack 3001          // ack dropped

 +0.6 > P. 991:1091(100) ack 3001         // RTO, ack accepted
   +0 < .  3001:4001(1000) ack 991 win 0  // no space available, but
peer has data to send at all times
   +0 > .  1091:1091(0) ack 4001          // ack dropped

 +1.2 > P. 991:1091(100) ack 4001         // another probe, accepted

  // this goes on and on. note that the peer always has data just
waiting there to be sent,
  // server acks it, but the ack is dropped because SEQ is incorrect.
  // only the probes are advancing the connection, but are back-offed
every time.

// Reset sysctls
`/tmp/sysctl_restore_${PPID}.sh`

Luiz Carvalho

On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>
>
> On 4/10/25 7:50 PM, Luiz Carvalho wrote:
> > The current tcp_acceptable_seq() returns SND.NXT when the available
> > window shrinks to less then one scaling factor. This works fine for most
> > cases, and seemed to not be a problem until a slight behavior change to
> > how tcp_select_window() handles ZeroWindow cases.
> >
> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> > a zero window would only be announced when data failed to be consumed,
> > and following packets would have non-zero windows despite the receiver
> > still not having any available space. After the commit, however, the
> > zero window is stored in the socket and the advertised window will be
> > zero until the receiver frees up space.
> >
> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT
> > being sent, but the problem now arises when the receptor validates the
> > sequence number in tcp_sequence():
> >
> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
> >                                        u32 seq, u32 end_seq)
> > {
> >       // ...
> >       if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> >               return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> >       // ...
> > }
> >
> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
> > dropped by the receiver, correctly, as RFC793 mentions:
> >
> >       There are four cases for the acceptability test for an incoming
> >         segment:
> >
> >       Segment Receive  Test
> >         Length  Window
> >         ------- -------  -------------------------------------------
> >
> >            0       0     SEG.SEQ = RCV.NXT
> >
> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
> > and the connection continues. If the receptor announces ZeroWindow
> > again, the stall could be very long, as was in my case. Found this out
> > while giving a shot at bug #213827.
>
> The dropped ack causing the stall is a zero window probe from the sender
> right?
> Could you please describe the relevant scenario with a pktdrill test?
>
> Thanks!
>
> Paolo
>

Eric Dumazet April 16, 2025, 9:40 p.m. UTC | #3

On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
<luizcmpc@gmail.com> wrote:
>
> Hi Paolo,
>
> The dropped ack is a response to data sent by the peer.
>
> Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> by the peer's tcp_sequence function. Connection only advances when we send a RTO.
>
> Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> the expected interaction to the patch if it makes sense.
>
> // Tests the invalid SEQs sent by the listener
> // which are then dropped by the peer.
>
> `./common/defaults.sh
> ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
>
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
>    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <...>
>   +.1 < . 1:1(0) ack 1 win 8
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 990) = 990
>    +0 > P. 1:991(990) ack 1
>    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
>
>    +0 write(4, ..., 100) = 100
>    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
>    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
>
>  +0.1 < P. 1:1001(1000) ack 901 win 0

This 'ack 901' does not seem right ?

Also your fix would not work if 'win 0' was 'win 1' , and/or if the
initial wscale was 6 instead of 7 ?

>    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
>                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
>                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;

I assume that your patch would change the 1091:1091(0) to 991:991(0) ?

It is not clear if there is a bug here... window reneging is outside
RFC specs unfortunately,
as hinted in the tcp_acceptable_seq() comments.

>
>  +0.2 > P. 991:1091(100) ack 1001         // this is a RTO, ack accepted
>    +0 < P. 1001:2001(1000) ack 991 win 0  // peer responds, still no space available, but has more data to send
>    +0 > .  1091:1091(0) ack 2001          // ack dropped
>
>  +0.3 > P. 991:1091(100) ack 2001         // RTO, ack accepted
>    +0 < .  2001:3001(1000) ack 991 win 0  // still no space available, but another chunk of data
>    +0 > .  1091:1091(0) ack 3001          // ack dropped
>
>  +0.6 > P. 991:1091(100) ack 3001         // RTO, ack accepted
>    +0 < .  3001:4001(1000) ack 991 win 0  // no space available, but peer has data to send at all times
>    +0 > .  1091:1091(0) ack 4001          // ack dropped
>
>  +1.2 > P. 991:1091(100) ack 4001         // another probe, accepted
>
>   // this goes on and on. note that the peer always has data just waiting there to be sent,
>   // server acks it, but the ack is dropped because SEQ is incorrect.
>   // only the RTOs are advancing the connection, but are back-offed every time.
>
> // Reset sysctls
> `/tmp/sysctl_restore_${PPID}.sh`
>
> On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>>
>>
>> On 4/10/25 7:50 PM, Luiz Carvalho wrote:
>> > The current tcp_acceptable_seq() returns SND.NXT when the available
>> > window shrinks to less then one scaling factor. This works fine for most
>> > cases, and seemed to not be a problem until a slight behavior change to
>> > how tcp_select_window() handles ZeroWindow cases.
>> >
>> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
>> > a zero window would only be announced when data failed to be consumed,
>> > and following packets would have non-zero windows despite the receiver
>> > still not having any available space. After the commit, however, the
>> > zero window is stored in the socket and the advertised window will be
>> > zero until the receiver frees up space.
>> >
>> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT
>> > being sent, but the problem now arises when the receptor validates the
>> > sequence number in tcp_sequence():
>> >
>> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
>> >                                        u32 seq, u32 end_seq)
>> > {
>> >       // ...
>> >       if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
>> >               return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
>> >       // ...
>> > }
>> >
>> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
>> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
>> > dropped by the receiver, correctly, as RFC793 mentions:
>> >
>> >       There are four cases for the acceptability test for an incoming
>> >         segment:
>> >
>> >       Segment Receive  Test
>> >         Length  Window
>> >         ------- -------  -------------------------------------------
>> >
>> >            0       0     SEG.SEQ = RCV.NXT
>> >
>> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
>> > and the connection continues. If the receptor announces ZeroWindow
>> > again, the stall could be very long, as was in my case. Found this out
>> > while giving a shot at bug #213827.
>>
>> The dropped ack causing the stall is a zero window probe from the sender
>> right?
>> Could you please describe the relevant scenario with a pktdrill test?
>>
>> Thanks!
>>
>> Paolo
>>

Luiz Carlos Mourão Paes de Carvalho April 16, 2025, 10:30 p.m. UTC | #4

On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> <luizcmpc@gmail.com> wrote:
> >
> > Hi Paolo,
> >
> > The dropped ack is a response to data sent by the peer.
> >
> > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> >
> > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > the expected interaction to the patch if it makes sense.
> >
> > // Tests the invalid SEQs sent by the listener
> > // which are then dropped by the peer.
> >
> > `./common/defaults.sh
> > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> >
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >    +0 bind(3, ..., ...) = 0
> >    +0 listen(3, 1) = 0
> >
> >    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> >    +0 > S. 0:0(0) ack 1 <...>
> >   +.1 < . 1:1(0) ack 1 win 8
> >    +0 accept(3, ..., ...) = 4
> >
> >    +0 write(4, ..., 990) = 990
> >    +0 > P. 1:991(990) ack 1
> >    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
> >
> >    +0 write(4, ..., 100) = 100
> >    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
> >    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> >
> >  +0.1 < P. 1:1001(1000) ack 901 win 0
>
> This 'ack 901' does not seem right ?

It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.

>
> Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> initial wscale was 6 instead of 7 ?

It indeed does not work if win=1, but that's unlikely to happen unless
you enable shrink_window, and probably
suggests the mentioned loss of precision.

Now, regarding the scale, it does happen with wscale=6 if your second
write sends < 64 bytes.
This is true with any other scale. Would happen if it were wscale=1
and the second write sent 2 bytes, etc.

Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.

>
> >    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> >                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> >                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
>
> I assume that your patch would change the 1091:1091(0) to 991:991(0) ?

Precisely.

>
> It is not clear if there is a bug here... window reneging is outside
> RFC specs unfortunately,
> as hinted in the tcp_acceptable_seq() comments.

Yeah, that got me thinking as well, but although it isn't covered by
the RFC, the behavior did change since
8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
which is a relatively recent patch (Jan 2025).
Currently, the connection could stall indefinitely, which seems
unwanted. I would be happy to search for other
solutions if you have anything come to mind, though.

The way I see it, the stack shouldn't be sending invalid ACKs that are
known to be incorrect.

>
> >
> >  +0.2 > P. 991:1091(100) ack 1001         // this is a RTO, ack accepted
> >    +0 < P. 1001:2001(1000) ack 991 win 0  // peer responds, still no space available, but has more data to send
> >    +0 > .  1091:1091(0) ack 2001          // ack dropped
> >
> >  +0.3 > P. 991:1091(100) ack 2001         // RTO, ack accepted
> >    +0 < .  2001:3001(1000) ack 991 win 0  // still no space available, but another chunk of data
> >    +0 > .  1091:1091(0) ack 3001          // ack dropped
> >
> >  +0.6 > P. 991:1091(100) ack 3001         // RTO, ack accepted
> >    +0 < .  3001:4001(1000) ack 991 win 0  // no space available, but peer has data to send at all times
> >    +0 > .  1091:1091(0) ack 4001          // ack dropped
> >
> >  +1.2 > P. 991:1091(100) ack 4001         // another probe, accepted
> >
> >   // this goes on and on. note that the peer always has data just waiting there to be sent,
> >   // server acks it, but the ack is dropped because SEQ is incorrect.
> >   // only the RTOs are advancing the connection, but are back-offed every time.
> >
> > // Reset sysctls
> > `/tmp/sysctl_restore_${PPID}.sh`
> >
> > On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >>
> >>
> >> On 4/10/25 7:50 PM, Luiz Carvalho wrote:
> >> > The current tcp_acceptable_seq() returns SND.NXT when the available
> >> > window shrinks to less then one scaling factor. This works fine for most
> >> > cases, and seemed to not be a problem until a slight behavior change to
> >> > how tcp_select_window() handles ZeroWindow cases.
> >> >
> >> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> >> > a zero window would only be announced when data failed to be consumed,
> >> > and following packets would have non-zero windows despite the receiver
> >> > still not having any available space. After the commit, however, the
> >> > zero window is stored in the socket and the advertised window will be
> >> > zero until the receiver frees up space.
> >> >
> >> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT
> >> > being sent, but the problem now arises when the receptor validates the
> >> > sequence number in tcp_sequence():
> >> >
> >> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
> >> >                                        u32 seq, u32 end_seq)
> >> > {
> >> >       // ...
> >> >       if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> >> >               return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> >> >       // ...
> >> > }
> >> >
> >> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail
> >> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is
> >> > dropped by the receiver, correctly, as RFC793 mentions:
> >> >
> >> >       There are four cases for the acceptability test for an incoming
> >> >         segment:
> >> >
> >> >       Segment Receive  Test
> >> >         Length  Window
> >> >         ------- -------  -------------------------------------------
> >> >
> >> >            0       0     SEG.SEQ = RCV.NXT
> >> >
> >> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again,
> >> > and the connection continues. If the receptor announces ZeroWindow
> >> > again, the stall could be very long, as was in my case. Found this out
> >> > while giving a shot at bug #213827.
> >>
> >> The dropped ack causing the stall is a zero window probe from the sender
> >> right?
> >> Could you please describe the relevant scenario with a pktdrill test?
> >>
> >> Thanks!
> >>
> >> Paolo
> >>

Eric Dumazet April 16, 2025, 10:32 p.m. UTC | #5

On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho
<luizcmpc@gmail.com> wrote:
>
> On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> > <luizcmpc@gmail.com> wrote:
> > >
> > > Hi Paolo,
> > >
> > > The dropped ack is a response to data sent by the peer.
> > >
> > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> > >
> > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > > the expected interaction to the patch if it makes sense.
> > >
> > > // Tests the invalid SEQs sent by the listener
> > > // which are then dropped by the peer.
> > >
> > > `./common/defaults.sh
> > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> > >
> > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > >    +0 bind(3, ..., ...) = 0
> > >    +0 listen(3, 1) = 0
> > >
> > >    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> > >    +0 > S. 0:0(0) ack 1 <...>
> > >   +.1 < . 1:1(0) ack 1 win 8
> > >    +0 accept(3, ..., ...) = 4
> > >
> > >    +0 write(4, ..., 990) = 990
> > >    +0 > P. 1:991(990) ack 1
> > >    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
> > >
> > >    +0 write(4, ..., 100) = 100
> > >    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
> > >    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> > >
> > >  +0.1 < P. 1:1001(1000) ack 901 win 0
> >
> > This 'ack 901' does not seem right ?
>
> It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.
>
> >
> > Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> > initial wscale was 6 instead of 7 ?
>
> It indeed does not work if win=1, but that's unlikely to happen unless
> you enable shrink_window, and probably
> suggests the mentioned loss of precision.
>
> Now, regarding the scale, it does happen with wscale=6 if your second
> write sends < 64 bytes.
> This is true with any other scale. Would happen if it were wscale=1
> and the second write sent 2 bytes, etc.
>
> Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.
>
> >
> > >    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> > >                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> > >                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> >
> > I assume that your patch would change the 1091:1091(0) to 991:991(0) ?
>
> Precisely.
>
> >
> > It is not clear if there is a bug here... window reneging is outside
> > RFC specs unfortunately,
> > as hinted in the tcp_acceptable_seq() comments.
>
> Yeah, that got me thinking as well, but although it isn't covered by
> the RFC, the behavior did change since
> 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> which is a relatively recent patch (Jan 2025).
> Currently, the connection could stall indefinitely, which seems
> unwanted. I would be happy to search for other
> solutions if you have anything come to mind, though.
>
> The way I see it, the stack shouldn't be sending invalid ACKs that are
> known to be incorrect.

These are not ACK, but sequence numbers. They were correct when initially sent.

Eric Dumazet April 16, 2025, 10:37 p.m. UTC | #6

On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho
> <luizcmpc@gmail.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> > > <luizcmpc@gmail.com> wrote:
> > > >
> > > > Hi Paolo,
> > > >
> > > > The dropped ack is a response to data sent by the peer.
> > > >
> > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> > > >
> > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > > > the expected interaction to the patch if it makes sense.
> > > >
> > > > // Tests the invalid SEQs sent by the listener
> > > > // which are then dropped by the peer.
> > > >
> > > > `./common/defaults.sh
> > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> > > >
> > > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > > >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > >    +0 bind(3, ..., ...) = 0
> > > >    +0 listen(3, 1) = 0
> > > >
> > > >    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> > > >    +0 > S. 0:0(0) ack 1 <...>
> > > >   +.1 < . 1:1(0) ack 1 win 8
> > > >    +0 accept(3, ..., ...) = 4
> > > >
> > > >    +0 write(4, ..., 990) = 990
> > > >    +0 > P. 1:991(990) ack 1
> > > >    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
> > > >
> > > >    +0 write(4, ..., 100) = 100
> > > >    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
> > > >    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> > > >
> > > >  +0.1 < P. 1:1001(1000) ack 901 win 0
> > >
> > > This 'ack 901' does not seem right ?
> >
> > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.
> >
> > >
> > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> > > initial wscale was 6 instead of 7 ?
> >
> > It indeed does not work if win=1, but that's unlikely to happen unless
> > you enable shrink_window, and probably
> > suggests the mentioned loss of precision.
> >
> > Now, regarding the scale, it does happen with wscale=6 if your second
> > write sends < 64 bytes.
> > This is true with any other scale. Would happen if it were wscale=1
> > and the second write sent 2 bytes, etc.
> >
> > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.
> >
> > >
> > > >    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> > > >                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> > > >                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> > >
> > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ?
> >
> > Precisely.
> >
> > >
> > > It is not clear if there is a bug here... window reneging is outside
> > > RFC specs unfortunately,
> > > as hinted in the tcp_acceptable_seq() comments.
> >
> > Yeah, that got me thinking as well, but although it isn't covered by
> > the RFC, the behavior did change since
> > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> > which is a relatively recent patch (Jan 2025).
> > Currently, the connection could stall indefinitely, which seems
> > unwanted. I would be happy to search for other
> > solutions if you have anything come to mind, though.
> >
> > The way I see it, the stack shouldn't be sending invalid ACKs that are
> > known to be incorrect.
>
> These are not ACK, but sequence numbers. They were correct when initially sent.

You might try to fix the issue on the other side of the connection,
the one doing reneging...

Luiz Carlos Mourão Paes de Carvalho April 16, 2025, 10:49 p.m. UTC | #7

On Wed, Apr 16, 2025 at 7:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho
> > <luizcmpc@gmail.com> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> > > > <luizcmpc@gmail.com> wrote:
> > > > >
> > > > > Hi Paolo,
> > > > >
> > > > > The dropped ack is a response to data sent by the peer.
> > > > >
> > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> > > > >
> > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > > > > the expected interaction to the patch if it makes sense.
> > > > >
> > > > > // Tests the invalid SEQs sent by the listener
> > > > > // which are then dropped by the peer.
> > > > >
> > > > > `./common/defaults.sh
> > > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> > > > >
> > > > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > > > >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > >    +0 bind(3, ..., ...) = 0
> > > > >    +0 listen(3, 1) = 0
> > > > >
> > > > >    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> > > > >    +0 > S. 0:0(0) ack 1 <...>
> > > > >   +.1 < . 1:1(0) ack 1 win 8
> > > > >    +0 accept(3, ..., ...) = 4
> > > > >
> > > > >    +0 write(4, ..., 990) = 990
> > > > >    +0 > P. 1:991(990) ack 1
> > > > >    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
> > > > >
> > > > >    +0 write(4, ..., 100) = 100
> > > > >    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
> > > > >    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> > > > >
> > > > >  +0.1 < P. 1:1001(1000) ack 901 win 0
> > > >
> > > > This 'ack 901' does not seem right ?
> > >
> > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.
> > >
> > > >
> > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> > > > initial wscale was 6 instead of 7 ?
> > >
> > > It indeed does not work if win=1, but that's unlikely to happen unless
> > > you enable shrink_window, and probably
> > > suggests the mentioned loss of precision.
> > >
> > > Now, regarding the scale, it does happen with wscale=6 if your second
> > > write sends < 64 bytes.
> > > This is true with any other scale. Would happen if it were wscale=1
> > > and the second write sent 2 bytes, etc.
> > >
> > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.
> > >
> > > >
> > > > >    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> > > > >                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> > > > >                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> > > >
> > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ?
> > >
> > > Precisely.
> > >
> > > >
> > > > It is not clear if there is a bug here... window reneging is outside
> > > > RFC specs unfortunately,
> > > > as hinted in the tcp_acceptable_seq() comments.
> > >
> > > Yeah, that got me thinking as well, but although it isn't covered by
> > > the RFC, the behavior did change since
> > > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> > > which is a relatively recent patch (Jan 2025).
> > > Currently, the connection could stall indefinitely, which seems
> > > unwanted. I would be happy to search for other
> > > solutions if you have anything come to mind, though.
> > >
> > > The way I see it, the stack shouldn't be sending invalid ACKs that are
> > > known to be incorrect.
> >
> > These are not ACK, but sequence numbers. They were correct when initially sent.

Yes, I meant invalid ACK packets, with "incorrect" sequence numbers
(more advanced than they should have been for this specific ZeroWindow
scenario). The server has enough knowledge to know what the other peer
expects (the RFC 793 quote in the original message), thus the "known
to be incorrect". I am, however, new to the spec and stack.

>
> You might try to fix the issue on the other side of the connection,
> the one doing reneging...

Would that be adjusting tcp_sequence as per RFC 793, page 69?

        If the RCV.WND is zero, no segments will be acceptable, but
        special allowance should be made to accept valid ACKs, URGs and
        RSTs.

My initial idea was to change tcp_sequence slightly to pass the test
if RCV.WND is 0. My assessment was that it could go against the
mentioned test (SEG.SEQ = RCV.NXT), and would also require some bigger
changes to tcp_validate_incoming as tcp_rcv_established would still
need to drop the SKB but process the ACK. I'd be happy to give it a
shot anyway.

Eric Dumazet April 16, 2025, 11 p.m. UTC | #8

On Wed, Apr 16, 2025 at 3:49 PM Luiz Carlos Mourão Paes de Carvalho
<luizcmpc@gmail.com> wrote:
>
> On Wed, Apr 16, 2025 at 7:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho
> > > <luizcmpc@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho
> > > > > <luizcmpc@gmail.com> wrote:
> > > > > >
> > > > > > Hi Paolo,
> > > > > >
> > > > > > The dropped ack is a response to data sent by the peer.
> > > > > >
> > > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped
> > > > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO.
> > > > > >
> > > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with
> > > > > > the expected interaction to the patch if it makes sense.
> > > > > >
> > > > > > // Tests the invalid SEQs sent by the listener
> > > > > > // which are then dropped by the peer.
> > > > > >
> > > > > > `./common/defaults.sh
> > > > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0`
> > > > > >
> > > > > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > > > > >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > > >    +0 bind(3, ..., ...) = 0
> > > > > >    +0 listen(3, 1) = 0
> > > > > >
> > > > > >    +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> > > > > >    +0 > S. 0:0(0) ack 1 <...>
> > > > > >   +.1 < . 1:1(0) ack 1 win 8
> > > > > >    +0 accept(3, ..., ...) = 4
> > > > > >
> > > > > >    +0 write(4, ..., 990) = 990
> > > > > >    +0 > P. 1:991(990) ack 1
> > > > > >    +0 < .  1:1(0) ack 991 win 8           // win=8 despite buffer being almost full, shrink_window=0
> > > > > >
> > > > > >    +0 write(4, ..., 100) = 100
> > > > > >    +0 > P. 991:1091(100) ack 1            // SND.NXT=1091
> > > > > >    +0 < .  1:1(0) ack 991 win 0           // failed to queue rx data, RCV.NXT=991, RCV.WND=0
> > > > > >
> > > > > >  +0.1 < P. 1:1001(1000) ack 901 win 0
> > > > >
> > > > > This 'ack 901' does not seem right ?
> > > >
> > > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that.
> > > >
> > > > >
> > > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the
> > > > > initial wscale was 6 instead of 7 ?
> > > >
> > > > It indeed does not work if win=1, but that's unlikely to happen unless
> > > > you enable shrink_window, and probably
> > > > suggests the mentioned loss of precision.
> > > >
> > > > Now, regarding the scale, it does happen with wscale=6 if your second
> > > > write sends < 64 bytes.
> > > > This is true with any other scale. Would happen if it were wscale=1
> > > > and the second write sent 2 bytes, etc.
> > > >
> > > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale.
> > > >
> > > > >
> > > > > >    +0 > .  1091:1091(0) ack 1001          // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991:
> > > > > >                                           // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> > > > > >                                           //     return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
> > > > >
> > > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ?
> > > >
> > > > Precisely.
> > > >
> > > > >
> > > > > It is not clear if there is a bug here... window reneging is outside
> > > > > RFC specs unfortunately,
> > > > > as hinted in the tcp_acceptable_seq() comments.
> > > >
> > > > Yeah, that got me thinking as well, but although it isn't covered by
> > > > the RFC, the behavior did change since
> > > > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"),
> > > > which is a relatively recent patch (Jan 2025).
> > > > Currently, the connection could stall indefinitely, which seems
> > > > unwanted. I would be happy to search for other
> > > > solutions if you have anything come to mind, though.
> > > >
> > > > The way I see it, the stack shouldn't be sending invalid ACKs that are
> > > > known to be incorrect.
> > >
> > > These are not ACK, but sequence numbers. They were correct when initially sent.
>
> Yes, I meant invalid ACK packets, with "incorrect" sequence numbers
> (more advanced than they should have been for this specific ZeroWindow
> scenario). The server has enough knowledge to know what the other peer
> expects (the RFC 793 quote in the original message), thus the "known
> to be incorrect". I am, however, new to the spec and stack.
>
> >
> > You might try to fix the issue on the other side of the connection,
> > the one doing reneging...
>
> Would that be adjusting tcp_sequence as per RFC 793, page 69?
>
>         If the RCV.WND is zero, no segments will be acceptable, but
>         special allowance should be made to accept valid ACKs, URGs and
>         RSTs.
>
> My initial idea was to change tcp_sequence slightly to pass the test
> if RCV.WND is 0. My assessment was that it could go against the
> mentioned test (SEG.SEQ = RCV.NXT), and would also require some bigger
> changes to tcp_validate_incoming as tcp_rcv_established would still
> need to drop the SKB but process the ACK. I'd be happy to give it a
> shot anyway.

Please do not focus on RWIN 0, but more generally.

If a peer sent a ACK @seq WIN (@X << wscale), at some point in the past,
then it should accept any SEQ _before_  @seq + (@X << wscale),
especially from a pure ACK packet.

Reneging (ie decrease RWIN) makes sense as a way to deal with memory stress,
but should allow pure acks to be accepted if there sequence is not too
far in the future.

tcp_receive_window(const struct tcp_sock *tp)
...
s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt;


...
if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
    return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;


-->

if (after(seq, max_admissible_seq(tp))
    return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;

Reneging would no retract max_admissible_seq() as far as
tcp_sequence() is concerned.

[net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0

Checks

Commit Message

Comments

Patch