diff mbox series

[RFC] tcp: fill the one wscale sized window to trigger zero window advertising

Message ID 20250117192859.28252-1-dzq.aishenghu0@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] tcp: fill the one wscale sized window to trigger zero window advertising | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Zhongqiu Duan Jan. 17, 2025, 7:28 p.m. UTC
If the rcvbuf of a slow receiver is full, the packet will be dropped
because tcp_try_rmem_schedule() cannot schedule more memory for it.
Usually the scaled window size is not MSS aligned. If the receiver
advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is
disabled, we need at least two packets to fill it. But the receiver will
not ACK the first one, and also do not offer a zero window since we never
shrink the offered window.
The sender waits for the ACK because the send window is not enough for
another MSS sized packet, tcp_snd_wnd_test() will return false, and
starts the TLP and then the retransmission timer for the first packet
until it is ACKed.
It may take a long time to resume transmission from retransmission after
the receiver clears the rcvbuf, depends on the times of retransmissions.

This issue should be rare today as GSO/TSO is a common technology,
especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or
commit d0d598ca86bd ("net: remove sk_route_forced_caps").
We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw
GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b).

Force split a large packet and send it to fill the window so that the
receiver can offer a zero window if he want.

Reproduce:

1. Set a large number for net.core.rmem_max on the RECV side to provide
   a large wscale value of 11, which will provide a 2048 window larger
   than the normal MSS 1448.
   Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to
   quickly trigger the problem. (optional)

   sysctl net.core.rmem_max=67108864
   sysctl net.ipv4.tcp_rmem="4096 131072 262144"

2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling
   the GSO/TSO of nic on the SEND side.
   (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes))
   algo. (Namespace and veth is okay, helper xfrm.sh is at the end.)

3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support)
   (a) nc -l -p 11235
   (b) nc -l -p 11235 -S

4. Send.
   (a) nc 9.9.6.110 11235 <bigdata
   (b) nc -S 9.9.7.110 11235 <bigdata

5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi)

ESTAB 0      0      9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256

6. The sender remains in the retransmission state. (ss -tnpOHoemi)

ESTAB 0      45104  9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048

Tcpdump:

```
51:10.437 S > R: seq 1910971411, win 64240, length 0
51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0
51:10.439 S > R: ack 1, win 502, length 0
51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448
51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448
51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448
51:10.440 R > S: ack 2897, win 31, length 0
51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448
51:10.440 R > S: ack 4345, win 31, length 0
51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448
51:10.440 R > S: ack 7241, win 30, length 0
<...>
51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448
51:10.527 R > S: ack 87257, win 2, length 0
51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448
51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448
51:10.577 R > S: ack 90153, win 1, length 0
51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448
51:10.627 R > S: ack 91601, win 1, length 0
<...>
51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448
51:14.127 R > S: ack 192961, win 1, length 0
51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448
51:14.177 R > S: ack 194409, win 1, length 0
<rcvbuf full>
51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448
<clear rcvbuf>
51:51.504 R > S: ack 194409, win 2, length 0
<retransmission timer timeout>
52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448
52:20.242 R > S: ack 195857, win 3, length 0
<...>
52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448
52:20.245 R > S: ack 223369, win 30, length 0
```

File: xfrm.sh

```
if [ "$1" = "l" ]; then
        mode=tunnel
        daddr=9.9.6.110
        laddr=9.9.6.120
        xdaddr=9.9.7.110
        xladdr=9.9.7.120
        ispi=0x20
        ospi=0x10
        dev=veth0
elif [ "$1" = "r" ]; then
        mode=tunnel
        daddr=9.9.6.120
        laddr=9.9.6.110
        xdaddr=9.9.7.120
        xladdr=9.9.7.110
        ispi=0x10
        ospi=0x20
        dev=veth1
else
        echo "Usage: $0 <l|r>"
        exit 1
fi

ip xfrm state flush
ip xfrm policy flush
ip link set $dev up
ip addr add $laddr/24 dev $dev
ip link add xfrm0 type xfrm dev $dev if_id 3
ip link set xfrm0 up
ip addr add $xladdr/24 dev xfrm0
ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm policy add dir in  tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3
ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3
```

Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com>
---
 net/ipv4/tcp_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Jan. 17, 2025, 7:50 p.m. UTC | #1
On Fri, Jan 17, 2025 at 8:29 PM Zhongqiu Duan <dzq.aishenghu0@gmail.com> wrote:
>
> If the rcvbuf of a slow receiver is full, the packet will be dropped
> because tcp_try_rmem_schedule() cannot schedule more memory for it.
> Usually the scaled window size is not MSS aligned. If the receiver
> advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is
> disabled, we need at least two packets to fill it. But the receiver will
> not ACK the first one, and also do not offer a zero window since we never
> shrink the offered window.
> The sender waits for the ACK because the send window is not enough for
> another MSS sized packet, tcp_snd_wnd_test() will return false, and
> starts the TLP and then the retransmission timer for the first packet
> until it is ACKed.
> It may take a long time to resume transmission from retransmission after
> the receiver clears the rcvbuf, depends on the times of retransmissions.
>
> This issue should be rare today as GSO/TSO is a common technology,
> especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or
> commit d0d598ca86bd ("net: remove sk_route_forced_caps").
> We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw
> GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b).
>
> Force split a large packet and send it to fill the window so that the
> receiver can offer a zero window if he want.
>
> Reproduce:
>
> 1. Set a large number for net.core.rmem_max on the RECV side to provide
>    a large wscale value of 11, which will provide a 2048 window larger
>    than the normal MSS 1448.
>    Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to
>    quickly trigger the problem. (optional)
>
>    sysctl net.core.rmem_max=67108864
>    sysctl net.ipv4.tcp_rmem="4096 131072 262144"
>
> 2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling
>    the GSO/TSO of nic on the SEND side.
>    (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes))
>    algo. (Namespace and veth is okay, helper xfrm.sh is at the end.)
>
> 3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support)
>    (a) nc -l -p 11235
>    (b) nc -l -p 11235 -S
>
> 4. Send.
>    (a) nc 9.9.6.110 11235 <bigdata
>    (b) nc -S 9.9.7.110 11235 <bigdata
>
> 5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi)
>
> ESTAB 0      0      9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256
>
> 6. The sender remains in the retransmission state. (ss -tnpOHoemi)
>
> ESTAB 0      45104  9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048
>
> Tcpdump:
>
> ```
> 51:10.437 S > R: seq 1910971411, win 64240, length 0
> 51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0
> 51:10.439 S > R: ack 1, win 502, length 0
> 51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448
> 51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448
> 51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 2897, win 31, length 0
> 51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 4345, win 31, length 0
> 51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 7241, win 30, length 0
> <...>
> 51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448
> 51:10.527 R > S: ack 87257, win 2, length 0
> 51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448
> 51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448
> 51:10.577 R > S: ack 90153, win 1, length 0
> 51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448
> 51:10.627 R > S: ack 91601, win 1, length 0
> <...>
> 51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448
> 51:14.127 R > S: ack 192961, win 1, length 0
> 51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448
> 51:14.177 R > S: ack 194409, win 1, length 0
> <rcvbuf full>

I have not seen a "win 0" though...


> 51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448
> <clear rcvbuf>
> 51:51.504 R > S: ack 194409, win 2, length 0
> <retransmission timer timeout>
> 52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 52:20.242 R > S: ack 195857, win 3, length 0
> <...>
> 52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448
> 52:20.245 R > S: ack 223369, win 30, length 0
> ```
>
> File: xfrm.sh
>
> ```
> if [ "$1" = "l" ]; then
>         mode=tunnel
>         daddr=9.9.6.110
>         laddr=9.9.6.120
>         xdaddr=9.9.7.110
>         xladdr=9.9.7.120
>         ispi=0x20
>         ospi=0x10
>         dev=veth0
> elif [ "$1" = "r" ]; then
>         mode=tunnel
>         daddr=9.9.6.120
>         laddr=9.9.6.110
>         xdaddr=9.9.7.120
>         xladdr=9.9.7.110
>         ispi=0x10
>         ospi=0x20
>         dev=veth1
> else
>         echo "Usage: $0 <l|r>"
>         exit 1
> fi
>
> ip xfrm state flush
> ip xfrm policy flush
> ip link set $dev up
> ip addr add $laddr/24 dev $dev
> ip link add xfrm0 type xfrm dev $dev if_id 3
> ip link set xfrm0 up
> ip addr add $xladdr/24 dev xfrm0
> ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
> ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
> ip xfrm policy add dir in  tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3
> ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3
> ```
>
> Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com>
> ---
>  net/ipv4/tcp_output.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..61debda90f6d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
>  {
>         u32 end_seq = TCP_SKB_CB(skb)->end_seq;
>
> +       if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd)
> +               return true;

This is not generic.

What if tp->snd_wnd == (2 <<  tp->rx_opt.snd_wscale), for wscale == 10 ?

> +
>         if (skb->len > cur_mss)
>                 end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
>
> @@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                 }
>
>                 limit = mss_now;
> -               if (tso_segs > 1 && !tcp_urg_mode(tp))
> +               if (!tcp_urg_mode(tp))
>                         limit = tcp_mss_split_point(sk, skb, mss_now,
>                                                     cwnd_quota,
>                                                     nonagle);
> --
> 2.34.1

I think you are trying to solve the issue at the sender side, in the
fast path, adding lots of cycles.

While the issue seems to be a receive side one, failing to send a "win
0" at the right time/conditions.

If the last ACK had a "win 1", I fail to see why a packet with length
<= 2048 can not be received.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..61debda90f6d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2143,6 +2143,9 @@  static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
 {
 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
+	if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd)
+		return true;
+
 	if (skb->len > cur_mss)
 		end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
 
@@ -2806,7 +2809,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 
 		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
+		if (!tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
 						    cwnd_quota,
 						    nonagle);