Message ID | 20250410175140.10805-3-luizcmpc@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp: tcp_acceptable_seq select SND.UNA when SND.WND is 0 | expand |
On 4/10/25 7:50 PM, Luiz Carvalho wrote: > The current tcp_acceptable_seq() returns SND.NXT when the available > window shrinks to less then one scaling factor. This works fine for most > cases, and seemed to not be a problem until a slight behavior change to > how tcp_select_window() handles ZeroWindow cases. > > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > a zero window would only be announced when data failed to be consumed, > and following packets would have non-zero windows despite the receiver > still not having any available space. After the commit, however, the > zero window is stored in the socket and the advertised window will be > zero until the receiver frees up space. > > For tcp_acceptable_seq(), a zero window case will result in SND.NXT > being sent, but the problem now arises when the receptor validates the > sequence number in tcp_sequence(): > > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp, > u32 seq, u32 end_seq) > { > // ... > if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > // ... > } > > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is > dropped by the receiver, correctly, as RFC793 mentions: > > There are four cases for the acceptability test for an incoming > segment: > > Segment Receive Test > Length Window > ------- ------- ------------------------------------------- > > 0 0 SEG.SEQ = RCV.NXT > > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again, > and the connection continues. If the receptor announces ZeroWindow > again, the stall could be very long, as was in my case. Found this out > while giving a shot at bug #213827. The dropped ack causing the stall is a zero window probe from the sender right? Could you please describe the relevant scenario with a pktdrill test? Thanks! Paolo
Hi Paolo, The dropped ack is a response to data sent by the peer. Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped by the peer's tcp_sequence function. Connection only advances when we send a RTO. Let me know if the following describes the scenario you expected. I'll add a packetdrill with the expected interaction to the patch if it makes sense. // Tests the invalid SEQs sent by the listener // which are then dropped by the client. `./common/defaults.sh ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 8 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 990) = 990 +0 > P. 1:991(990) ack 1 +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 +0 write(4, ..., 100) = 100 +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 +0.1 < P. 1:1001(1000) ack 901 win 0 +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; +0.2 > P. 991:1091(100) ack 1001 // this is a RTO, ack accepted +0 < P. 1001:2001(1000) ack 991 win 0 // peer responds, still no space available, but has more data to send +0 > . 1091:1091(0) ack 2001 // ack dropped +0.3 > P. 991:1091(100) ack 2001 // RTO, ack accepted +0 < . 2001:3001(1000) ack 991 win 0 // still no space available, but another chunk of data +0 > . 1091:1091(0) ack 3001 // ack dropped +0.6 > P. 991:1091(100) ack 3001 // RTO, ack accepted +0 < . 3001:4001(1000) ack 991 win 0 // no space available, but peer has data to send at all times +0 > . 1091:1091(0) ack 4001 // ack dropped +1.2 > P. 991:1091(100) ack 4001 // another probe, accepted // this goes on and on. note that the peer always has data just waiting there to be sent, // server acks it, but the ack is dropped because SEQ is incorrect. // only the probes are advancing the connection, but are back-offed every time. // Reset sysctls `/tmp/sysctl_restore_${PPID}.sh` Luiz Carvalho On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On 4/10/25 7:50 PM, Luiz Carvalho wrote: > > The current tcp_acceptable_seq() returns SND.NXT when the available > > window shrinks to less then one scaling factor. This works fine for most > > cases, and seemed to not be a problem until a slight behavior change to > > how tcp_select_window() handles ZeroWindow cases. > > > > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > > a zero window would only be announced when data failed to be consumed, > > and following packets would have non-zero windows despite the receiver > > still not having any available space. After the commit, however, the > > zero window is stored in the socket and the advertised window will be > > zero until the receiver frees up space. > > > > For tcp_acceptable_seq(), a zero window case will result in SND.NXT > > being sent, but the problem now arises when the receptor validates the > > sequence number in tcp_sequence(): > > > > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp, > > u32 seq, u32 end_seq) > > { > > // ... > > if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > // ... > > } > > > > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail > > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is > > dropped by the receiver, correctly, as RFC793 mentions: > > > > There are four cases for the acceptability test for an incoming > > segment: > > > > Segment Receive Test > > Length Window > > ------- ------- ------------------------------------------- > > > > 0 0 SEG.SEQ = RCV.NXT > > > > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again, > > and the connection continues. If the receptor announces ZeroWindow > > again, the stall could be very long, as was in my case. Found this out > > while giving a shot at bug #213827. > > The dropped ack causing the stall is a zero window probe from the sender > right? > Could you please describe the relevant scenario with a pktdrill test? > > Thanks! > > Paolo >
On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho <luizcmpc@gmail.com> wrote: > > Hi Paolo, > > The dropped ack is a response to data sent by the peer. > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > the expected interaction to the patch if it makes sense. > > // Tests the invalid SEQs sent by the listener > // which are then dropped by the peer. > > `./common/defaults.sh > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > +0 bind(3, ..., ...) = 0 > +0 listen(3, 1) = 0 > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > +0 > S. 0:0(0) ack 1 <...> > +.1 < . 1:1(0) ack 1 win 8 > +0 accept(3, ..., ...) = 4 > > +0 write(4, ..., 990) = 990 > +0 > P. 1:991(990) ack 1 > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > +0 write(4, ..., 100) = 100 > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > +0.1 < P. 1:1001(1000) ack 901 win 0 This 'ack 901' does not seem right ? Also your fix would not work if 'win 0' was 'win 1' , and/or if the initial wscale was 6 instead of 7 ? > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; I assume that your patch would change the 1091:1091(0) to 991:991(0) ? It is not clear if there is a bug here... window reneging is outside RFC specs unfortunately, as hinted in the tcp_acceptable_seq() comments. > > +0.2 > P. 991:1091(100) ack 1001 // this is a RTO, ack accepted > +0 < P. 1001:2001(1000) ack 991 win 0 // peer responds, still no space available, but has more data to send > +0 > . 1091:1091(0) ack 2001 // ack dropped > > +0.3 > P. 991:1091(100) ack 2001 // RTO, ack accepted > +0 < . 2001:3001(1000) ack 991 win 0 // still no space available, but another chunk of data > +0 > . 1091:1091(0) ack 3001 // ack dropped > > +0.6 > P. 991:1091(100) ack 3001 // RTO, ack accepted > +0 < . 3001:4001(1000) ack 991 win 0 // no space available, but peer has data to send at all times > +0 > . 1091:1091(0) ack 4001 // ack dropped > > +1.2 > P. 991:1091(100) ack 4001 // another probe, accepted > > // this goes on and on. note that the peer always has data just waiting there to be sent, > // server acks it, but the ack is dropped because SEQ is incorrect. > // only the RTOs are advancing the connection, but are back-offed every time. > > // Reset sysctls > `/tmp/sysctl_restore_${PPID}.sh` > > On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote: >> >> >> >> On 4/10/25 7:50 PM, Luiz Carvalho wrote: >> > The current tcp_acceptable_seq() returns SND.NXT when the available >> > window shrinks to less then one scaling factor. This works fine for most >> > cases, and seemed to not be a problem until a slight behavior change to >> > how tcp_select_window() handles ZeroWindow cases. >> > >> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), >> > a zero window would only be announced when data failed to be consumed, >> > and following packets would have non-zero windows despite the receiver >> > still not having any available space. After the commit, however, the >> > zero window is stored in the socket and the advertised window will be >> > zero until the receiver frees up space. >> > >> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT >> > being sent, but the problem now arises when the receptor validates the >> > sequence number in tcp_sequence(): >> > >> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp, >> > u32 seq, u32 end_seq) >> > { >> > // ... >> > if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) >> > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; >> > // ... >> > } >> > >> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail >> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is >> > dropped by the receiver, correctly, as RFC793 mentions: >> > >> > There are four cases for the acceptability test for an incoming >> > segment: >> > >> > Segment Receive Test >> > Length Window >> > ------- ------- ------------------------------------------- >> > >> > 0 0 SEG.SEQ = RCV.NXT >> > >> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again, >> > and the connection continues. If the receptor announces ZeroWindow >> > again, the stall could be very long, as was in my case. Found this out >> > while giving a shot at bug #213827. >> >> The dropped ack causing the stall is a zero window probe from the sender >> right? >> Could you please describe the relevant scenario with a pktdrill test? >> >> Thanks! >> >> Paolo >>
On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho > <luizcmpc@gmail.com> wrote: > > > > Hi Paolo, > > > > The dropped ack is a response to data sent by the peer. > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > > the expected interaction to the patch if it makes sense. > > > > // Tests the invalid SEQs sent by the listener > > // which are then dropped by the peer. > > > > `./common/defaults.sh > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > +0 bind(3, ..., ...) = 0 > > +0 listen(3, 1) = 0 > > > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > > +0 > S. 0:0(0) ack 1 <...> > > +.1 < . 1:1(0) ack 1 win 8 > > +0 accept(3, ..., ...) = 4 > > > > +0 write(4, ..., 990) = 990 > > +0 > P. 1:991(990) ack 1 > > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > > > +0 write(4, ..., 100) = 100 > > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > > > +0.1 < P. 1:1001(1000) ack 901 win 0 > > This 'ack 901' does not seem right ? It's indeed incorrect, the bug still occurs if it were 991. Sorry for that. > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the > initial wscale was 6 instead of 7 ? It indeed does not work if win=1, but that's unlikely to happen unless you enable shrink_window, and probably suggests the mentioned loss of precision. Now, regarding the scale, it does happen with wscale=6 if your second write sends < 64 bytes. This is true with any other scale. Would happen if it were wscale=1 and the second write sent 2 bytes, etc. Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale. > > > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ? Precisely. > > It is not clear if there is a bug here... window reneging is outside > RFC specs unfortunately, > as hinted in the tcp_acceptable_seq() comments. Yeah, that got me thinking as well, but although it isn't covered by the RFC, the behavior did change since 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), which is a relatively recent patch (Jan 2025). Currently, the connection could stall indefinitely, which seems unwanted. I would be happy to search for other solutions if you have anything come to mind, though. The way I see it, the stack shouldn't be sending invalid ACKs that are known to be incorrect. > > > > > +0.2 > P. 991:1091(100) ack 1001 // this is a RTO, ack accepted > > +0 < P. 1001:2001(1000) ack 991 win 0 // peer responds, still no space available, but has more data to send > > +0 > . 1091:1091(0) ack 2001 // ack dropped > > > > +0.3 > P. 991:1091(100) ack 2001 // RTO, ack accepted > > +0 < . 2001:3001(1000) ack 991 win 0 // still no space available, but another chunk of data > > +0 > . 1091:1091(0) ack 3001 // ack dropped > > > > +0.6 > P. 991:1091(100) ack 3001 // RTO, ack accepted > > +0 < . 3001:4001(1000) ack 991 win 0 // no space available, but peer has data to send at all times > > +0 > . 1091:1091(0) ack 4001 // ack dropped > > > > +1.2 > P. 991:1091(100) ack 4001 // another probe, accepted > > > > // this goes on and on. note that the peer always has data just waiting there to be sent, > > // server acks it, but the ack is dropped because SEQ is incorrect. > > // only the RTOs are advancing the connection, but are back-offed every time. > > > > // Reset sysctls > > `/tmp/sysctl_restore_${PPID}.sh` > > > > On Tue, Apr 15, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > >> > >> > >> > >> On 4/10/25 7:50 PM, Luiz Carvalho wrote: > >> > The current tcp_acceptable_seq() returns SND.NXT when the available > >> > window shrinks to less then one scaling factor. This works fine for most > >> > cases, and seemed to not be a problem until a slight behavior change to > >> > how tcp_select_window() handles ZeroWindow cases. > >> > > >> > Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > >> > a zero window would only be announced when data failed to be consumed, > >> > and following packets would have non-zero windows despite the receiver > >> > still not having any available space. After the commit, however, the > >> > zero window is stored in the socket and the advertised window will be > >> > zero until the receiver frees up space. > >> > > >> > For tcp_acceptable_seq(), a zero window case will result in SND.NXT > >> > being sent, but the problem now arises when the receptor validates the > >> > sequence number in tcp_sequence(): > >> > > >> > static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp, > >> > u32 seq, u32 end_seq) > >> > { > >> > // ... > >> > if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > >> > return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > >> > // ... > >> > } > >> > > >> > Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail > >> > the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is > >> > dropped by the receiver, correctly, as RFC793 mentions: > >> > > >> > There are four cases for the acceptability test for an incoming > >> > segment: > >> > > >> > Segment Receive Test > >> > Length Window > >> > ------- ------- ------------------------------------------- > >> > > >> > 0 0 SEG.SEQ = RCV.NXT > >> > > >> > The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again, > >> > and the connection continues. If the receptor announces ZeroWindow > >> > again, the stall could be very long, as was in my case. Found this out > >> > while giving a shot at bug #213827. > >> > >> The dropped ack causing the stall is a zero window probe from the sender > >> right? > >> Could you please describe the relevant scenario with a pktdrill test? > >> > >> Thanks! > >> > >> Paolo > >>
On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho <luizcmpc@gmail.com> wrote: > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho > > <luizcmpc@gmail.com> wrote: > > > > > > Hi Paolo, > > > > > > The dropped ack is a response to data sent by the peer. > > > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > > > the expected interaction to the patch if it makes sense. > > > > > > // Tests the invalid SEQs sent by the listener > > > // which are then dropped by the peer. > > > > > > `./common/defaults.sh > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > > +0 bind(3, ..., ...) = 0 > > > +0 listen(3, 1) = 0 > > > > > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > > > +0 > S. 0:0(0) ack 1 <...> > > > +.1 < . 1:1(0) ack 1 win 8 > > > +0 accept(3, ..., ...) = 4 > > > > > > +0 write(4, ..., 990) = 990 > > > +0 > P. 1:991(990) ack 1 > > > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > > > > > +0 write(4, ..., 100) = 100 > > > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > > > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > > > > > +0.1 < P. 1:1001(1000) ack 901 win 0 > > > > This 'ack 901' does not seem right ? > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that. > > > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the > > initial wscale was 6 instead of 7 ? > > It indeed does not work if win=1, but that's unlikely to happen unless > you enable shrink_window, and probably > suggests the mentioned loss of precision. > > Now, regarding the scale, it does happen with wscale=6 if your second > write sends < 64 bytes. > This is true with any other scale. Would happen if it were wscale=1 > and the second write sent 2 bytes, etc. > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale. > > > > > > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > > > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ? > > Precisely. > > > > > It is not clear if there is a bug here... window reneging is outside > > RFC specs unfortunately, > > as hinted in the tcp_acceptable_seq() comments. > > Yeah, that got me thinking as well, but although it isn't covered by > the RFC, the behavior did change since > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > which is a relatively recent patch (Jan 2025). > Currently, the connection could stall indefinitely, which seems > unwanted. I would be happy to search for other > solutions if you have anything come to mind, though. > > The way I see it, the stack shouldn't be sending invalid ACKs that are > known to be incorrect. These are not ACK, but sequence numbers. They were correct when initially sent.
On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho > <luizcmpc@gmail.com> wrote: > > > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho > > > <luizcmpc@gmail.com> wrote: > > > > > > > > Hi Paolo, > > > > > > > > The dropped ack is a response to data sent by the peer. > > > > > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > > > > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > > > > the expected interaction to the patch if it makes sense. > > > > > > > > // Tests the invalid SEQs sent by the listener > > > > // which are then dropped by the peer. > > > > > > > > `./common/defaults.sh > > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > > > > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > > > +0 bind(3, ..., ...) = 0 > > > > +0 listen(3, 1) = 0 > > > > > > > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > > > > +0 > S. 0:0(0) ack 1 <...> > > > > +.1 < . 1:1(0) ack 1 win 8 > > > > +0 accept(3, ..., ...) = 4 > > > > > > > > +0 write(4, ..., 990) = 990 > > > > +0 > P. 1:991(990) ack 1 > > > > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > > > > > > > +0 write(4, ..., 100) = 100 > > > > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > > > > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > > > > > > > +0.1 < P. 1:1001(1000) ack 901 win 0 > > > > > > This 'ack 901' does not seem right ? > > > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that. > > > > > > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the > > > initial wscale was 6 instead of 7 ? > > > > It indeed does not work if win=1, but that's unlikely to happen unless > > you enable shrink_window, and probably > > suggests the mentioned loss of precision. > > > > Now, regarding the scale, it does happen with wscale=6 if your second > > write sends < 64 bytes. > > This is true with any other scale. Would happen if it were wscale=1 > > and the second write sent 2 bytes, etc. > > > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale. > > > > > > > > > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > > > > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > > > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > > > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ? > > > > Precisely. > > > > > > > > It is not clear if there is a bug here... window reneging is outside > > > RFC specs unfortunately, > > > as hinted in the tcp_acceptable_seq() comments. > > > > Yeah, that got me thinking as well, but although it isn't covered by > > the RFC, the behavior did change since > > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > > which is a relatively recent patch (Jan 2025). > > Currently, the connection could stall indefinitely, which seems > > unwanted. I would be happy to search for other > > solutions if you have anything come to mind, though. > > > > The way I see it, the stack shouldn't be sending invalid ACKs that are > > known to be incorrect. > > These are not ACK, but sequence numbers. They were correct when initially sent. You might try to fix the issue on the other side of the connection, the one doing reneging...
On Wed, Apr 16, 2025 at 7:37 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho > > <luizcmpc@gmail.com> wrote: > > > > > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho > > > > <luizcmpc@gmail.com> wrote: > > > > > > > > > > Hi Paolo, > > > > > > > > > > The dropped ack is a response to data sent by the peer. > > > > > > > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > > > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > > > > > > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > > > > > the expected interaction to the patch if it makes sense. > > > > > > > > > > // Tests the invalid SEQs sent by the listener > > > > > // which are then dropped by the peer. > > > > > > > > > > `./common/defaults.sh > > > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > > > > > > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > > > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > > > > +0 bind(3, ..., ...) = 0 > > > > > +0 listen(3, 1) = 0 > > > > > > > > > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > > > > > +0 > S. 0:0(0) ack 1 <...> > > > > > +.1 < . 1:1(0) ack 1 win 8 > > > > > +0 accept(3, ..., ...) = 4 > > > > > > > > > > +0 write(4, ..., 990) = 990 > > > > > +0 > P. 1:991(990) ack 1 > > > > > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > > > > > > > > > +0 write(4, ..., 100) = 100 > > > > > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > > > > > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > > > > > > > > > +0.1 < P. 1:1001(1000) ack 901 win 0 > > > > > > > > This 'ack 901' does not seem right ? > > > > > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that. > > > > > > > > > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the > > > > initial wscale was 6 instead of 7 ? > > > > > > It indeed does not work if win=1, but that's unlikely to happen unless > > > you enable shrink_window, and probably > > > suggests the mentioned loss of precision. > > > > > > Now, regarding the scale, it does happen with wscale=6 if your second > > > write sends < 64 bytes. > > > This is true with any other scale. Would happen if it were wscale=1 > > > and the second write sent 2 bytes, etc. > > > > > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale. > > > > > > > > > > > > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > > > > > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > > > > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > > > > > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ? > > > > > > Precisely. > > > > > > > > > > > It is not clear if there is a bug here... window reneging is outside > > > > RFC specs unfortunately, > > > > as hinted in the tcp_acceptable_seq() comments. > > > > > > Yeah, that got me thinking as well, but although it isn't covered by > > > the RFC, the behavior did change since > > > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > > > which is a relatively recent patch (Jan 2025). > > > Currently, the connection could stall indefinitely, which seems > > > unwanted. I would be happy to search for other > > > solutions if you have anything come to mind, though. > > > > > > The way I see it, the stack shouldn't be sending invalid ACKs that are > > > known to be incorrect. > > > > These are not ACK, but sequence numbers. They were correct when initially sent. Yes, I meant invalid ACK packets, with "incorrect" sequence numbers (more advanced than they should have been for this specific ZeroWindow scenario). The server has enough knowledge to know what the other peer expects (the RFC 793 quote in the original message), thus the "known to be incorrect". I am, however, new to the spec and stack. > > You might try to fix the issue on the other side of the connection, > the one doing reneging... Would that be adjusting tcp_sequence as per RFC 793, page 69? If the RCV.WND is zero, no segments will be acceptable, but special allowance should be made to accept valid ACKs, URGs and RSTs. My initial idea was to change tcp_sequence slightly to pass the test if RCV.WND is 0. My assessment was that it could go against the mentioned test (SEG.SEQ = RCV.NXT), and would also require some bigger changes to tcp_validate_incoming as tcp_rcv_established would still need to drop the SKB but process the ACK. I'd be happy to give it a shot anyway.
On Wed, Apr 16, 2025 at 3:49 PM Luiz Carlos Mourão Paes de Carvalho <luizcmpc@gmail.com> wrote: > > On Wed, Apr 16, 2025 at 7:37 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Apr 16, 2025 at 3:32 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Apr 16, 2025 at 3:30 PM Luiz Carlos Mourão Paes de Carvalho > > > <luizcmpc@gmail.com> wrote: > > > > > > > > On Wed, Apr 16, 2025 at 6:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Wed, Apr 16, 2025 at 1:52 PM Luiz Carlos Mourão Paes de Carvalho > > > > > <luizcmpc@gmail.com> wrote: > > > > > > > > > > > > Hi Paolo, > > > > > > > > > > > > The dropped ack is a response to data sent by the peer. > > > > > > > > > > > > Peer sends a chunk of data, we ACK with an incorrect SEQ (SND.NXT) that gets dropped > > > > > > by the peer's tcp_sequence function. Connection only advances when we send a RTO. > > > > > > > > > > > > Let me know if the following describes the scenario you expected. I'll add a packetdrill with > > > > > > the expected interaction to the patch if it makes sense. > > > > > > > > > > > > // Tests the invalid SEQs sent by the listener > > > > > > // which are then dropped by the peer. > > > > > > > > > > > > `./common/defaults.sh > > > > > > ./common/set_sysctls.py /proc/sys/net/ipv4/tcp_shrink_window=0` > > > > > > > > > > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > > > > > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > > > > > +0 bind(3, ..., ...) = 0 > > > > > > +0 listen(3, 1) = 0 > > > > > > > > > > > > +0 < S 0:0(0) win 8 <mss 1000,sackOK,nop,nop,nop,wscale 7> > > > > > > +0 > S. 0:0(0) ack 1 <...> > > > > > > +.1 < . 1:1(0) ack 1 win 8 > > > > > > +0 accept(3, ..., ...) = 4 > > > > > > > > > > > > +0 write(4, ..., 990) = 990 > > > > > > +0 > P. 1:991(990) ack 1 > > > > > > +0 < . 1:1(0) ack 991 win 8 // win=8 despite buffer being almost full, shrink_window=0 > > > > > > > > > > > > +0 write(4, ..., 100) = 100 > > > > > > +0 > P. 991:1091(100) ack 1 // SND.NXT=1091 > > > > > > +0 < . 1:1(0) ack 991 win 0 // failed to queue rx data, RCV.NXT=991, RCV.WND=0 > > > > > > > > > > > > +0.1 < P. 1:1001(1000) ack 901 win 0 > > > > > > > > > > This 'ack 901' does not seem right ? > > > > > > > > It's indeed incorrect, the bug still occurs if it were 991. Sorry for that. > > > > > > > > > > > > > > Also your fix would not work if 'win 0' was 'win 1' , and/or if the > > > > > initial wscale was 6 instead of 7 ? > > > > > > > > It indeed does not work if win=1, but that's unlikely to happen unless > > > > you enable shrink_window, and probably > > > > suggests the mentioned loss of precision. > > > > > > > > Now, regarding the scale, it does happen with wscale=6 if your second > > > > write sends < 64 bytes. > > > > This is true with any other scale. Would happen if it were wscale=1 > > > > and the second write sent 2 bytes, etc. > > > > > > > > Happens as far as SND.NXT - (SND.UNA + SND.WND) < 1 << wscale. > > > > > > > > > > > > > > > +0 > . 1091:1091(0) ack 1001 // dropped on tcp_sequence, note that SEQ=1091, while (RCV.NXT + RCV.WND)=991: > > > > > > // if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) > > > > > > // return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; > > > > > > > > > > I assume that your patch would change the 1091:1091(0) to 991:991(0) ? > > > > > > > > Precisely. > > > > > > > > > > > > > > It is not clear if there is a bug here... window reneging is outside > > > > > RFC specs unfortunately, > > > > > as hinted in the tcp_acceptable_seq() comments. > > > > > > > > Yeah, that got me thinking as well, but although it isn't covered by > > > > the RFC, the behavior did change since > > > > 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), > > > > which is a relatively recent patch (Jan 2025). > > > > Currently, the connection could stall indefinitely, which seems > > > > unwanted. I would be happy to search for other > > > > solutions if you have anything come to mind, though. > > > > > > > > The way I see it, the stack shouldn't be sending invalid ACKs that are > > > > known to be incorrect. > > > > > > These are not ACK, but sequence numbers. They were correct when initially sent. > > Yes, I meant invalid ACK packets, with "incorrect" sequence numbers > (more advanced than they should have been for this specific ZeroWindow > scenario). The server has enough knowledge to know what the other peer > expects (the RFC 793 quote in the original message), thus the "known > to be incorrect". I am, however, new to the spec and stack. > > > > > You might try to fix the issue on the other side of the connection, > > the one doing reneging... > > Would that be adjusting tcp_sequence as per RFC 793, page 69? > > If the RCV.WND is zero, no segments will be acceptable, but > special allowance should be made to accept valid ACKs, URGs and > RSTs. > > My initial idea was to change tcp_sequence slightly to pass the test > if RCV.WND is 0. My assessment was that it could go against the > mentioned test (SEG.SEQ = RCV.NXT), and would also require some bigger > changes to tcp_validate_incoming as tcp_rcv_established would still > need to drop the SKB but process the ACK. I'd be happy to give it a > shot anyway. Please do not focus on RWIN 0, but more generally. If a peer sent a ACK @seq WIN (@X << wscale), at some point in the past, then it should accept any SEQ _before_ @seq + (@X << wscale), especially from a pure ACK packet. Reneging (ie decrease RWIN) makes sense as a way to deal with memory stress, but should allow pure acks to be accepted if there sequence is not too far in the future. tcp_receive_window(const struct tcp_sock *tp) ... s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt; ... if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; --> if (after(seq, max_admissible_seq(tp)) return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; Reneging would no retract max_admissible_seq() as far as tcp_sequence() is concerned.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index bc95d2a5924f..4d05083372a0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -88,7 +88,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb) } /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one - * window scaling factor due to loss of precision. + * window scaling factor due to loss of precision, but window is not zero. * If window has been shrunk, what should we make? It is not clear at all. * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-( * Anything in between SND.UNA...SND.UNA+SND.WND also can be already @@ -99,7 +99,7 @@ static inline __u32 tcp_acceptable_seq(const struct sock *sk) const struct tcp_sock *tp = tcp_sk(sk); if (!before(tcp_wnd_end(tp), tp->snd_nxt) || - (tp->rx_opt.wscale_ok && + (tp->snd_wnd && tp->rx_opt.wscale_ok && ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp->rx_opt.rcv_wscale)))) return tp->snd_nxt; else
The current tcp_acceptable_seq() returns SND.NXT when the available window shrinks to less then one scaling factor. This works fine for most cases, and seemed to not be a problem until a slight behavior change to how tcp_select_window() handles ZeroWindow cases. Before commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze"), a zero window would only be announced when data failed to be consumed, and following packets would have non-zero windows despite the receiver still not having any available space. After the commit, however, the zero window is stored in the socket and the advertised window will be zero until the receiver frees up space. For tcp_acceptable_seq(), a zero window case will result in SND.NXT being sent, but the problem now arises when the receptor validates the sequence number in tcp_sequence(): static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp, u32 seq, u32 end_seq) { // ... if (after(seq, tp->rcv_nxt + tcp_receive_window(tp))) return SKB_DROP_REASON_TCP_INVALID_SEQUENCE; // ... } Because RCV.WND is now stored in the socket as zero, using SND.NXT will fail the INVALID_SEQUENCE check: SEG.SEQ <= RCV.NXT + RCV.WND. A valid ACK is dropped by the receiver, correctly, as RFC793 mentions: There are four cases for the acceptability test for an incoming segment: Segment Receive Test Length Window ------- ------- ------------------------------------------- 0 0 SEG.SEQ = RCV.NXT The ACK will be ignored until tcp_write_wakeup() sends SND.UNA again, and the connection continues. If the receptor announces ZeroWindow again, the stall could be very long, as was in my case. Found this out while giving a shot at bug #213827. Could the precision loss lead to a zero window? If so, then this approach probably won't work. Fixes: a4ecb15a2432 ("tcp: accommodate sequence number to a peer's shrunk receive window caused by precision loss in window scaling") Link: https://bugzilla.kernel.org/show_bug.cgi?id=213827 Signed-off-by: Luiz Carvalho <luizcmpc@gmail.com> --- net/ipv4/tcp_output.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)