Message ID | 20250117192859.28252-1-dzq.aishenghu0@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] tcp: fill the one wscale sized window to trigger zero window advertising | expand |
On Fri, Jan 17, 2025 at 8:29 PM Zhongqiu Duan <dzq.aishenghu0@gmail.com> wrote: > > If the rcvbuf of a slow receiver is full, the packet will be dropped > because tcp_try_rmem_schedule() cannot schedule more memory for it. > Usually the scaled window size is not MSS aligned. If the receiver > advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is > disabled, we need at least two packets to fill it. But the receiver will > not ACK the first one, and also do not offer a zero window since we never > shrink the offered window. > The sender waits for the ACK because the send window is not enough for > another MSS sized packet, tcp_snd_wnd_test() will return false, and > starts the TLP and then the retransmission timer for the first packet > until it is ACKed. > It may take a long time to resume transmission from retransmission after > the receiver clears the rcvbuf, depends on the times of retransmissions. > > This issue should be rare today as GSO/TSO is a common technology, > especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or > commit d0d598ca86bd ("net: remove sk_route_forced_caps"). > We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw > GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b). > > Force split a large packet and send it to fill the window so that the > receiver can offer a zero window if he want. > > Reproduce: > > 1. Set a large number for net.core.rmem_max on the RECV side to provide > a large wscale value of 11, which will provide a 2048 window larger > than the normal MSS 1448. > Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to > quickly trigger the problem. (optional) > > sysctl net.core.rmem_max=67108864 > sysctl net.ipv4.tcp_rmem="4096 131072 262144" > > 2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling > the GSO/TSO of nic on the SEND side. > (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes)) > algo. (Namespace and veth is okay, helper xfrm.sh is at the end.) > > 3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support) > (a) nc -l -p 11235 > (b) nc -l -p 11235 -S > > 4. Send. > (a) nc 9.9.6.110 11235 <bigdata > (b) nc -S 9.9.7.110 11235 <bigdata > > 5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi) > > ESTAB 0 0 9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256 > > 6. The sender remains in the retransmission state. (ss -tnpOHoemi) > > ESTAB 0 45104 9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048 > > Tcpdump: > > ``` > 51:10.437 S > R: seq 1910971411, win 64240, length 0 > 51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0 > 51:10.439 S > R: ack 1, win 502, length 0 > 51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448 > 51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448 > 51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448 > 51:10.440 R > S: ack 2897, win 31, length 0 > 51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448 > 51:10.440 R > S: ack 4345, win 31, length 0 > 51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448 > 51:10.440 R > S: ack 7241, win 30, length 0 > <...> > 51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448 > 51:10.527 R > S: ack 87257, win 2, length 0 > 51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448 > 51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448 > 51:10.577 R > S: ack 90153, win 1, length 0 > 51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448 > 51:10.627 R > S: ack 91601, win 1, length 0 > <...> > 51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448 > 51:14.127 R > S: ack 192961, win 1, length 0 > 51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448 > 51:14.177 R > S: ack 194409, win 1, length 0 > <rcvbuf full> I have not seen a "win 0" though... > 51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448 > <clear rcvbuf> > 51:51.504 R > S: ack 194409, win 2, length 0 > <retransmission timer timeout> > 52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448 > 52:20.242 R > S: ack 195857, win 3, length 0 > <...> > 52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448 > 52:20.245 R > S: ack 223369, win 30, length 0 > ``` > > File: xfrm.sh > > ``` > if [ "$1" = "l" ]; then > mode=tunnel > daddr=9.9.6.110 > laddr=9.9.6.120 > xdaddr=9.9.7.110 > xladdr=9.9.7.120 > ispi=0x20 > ospi=0x10 > dev=veth0 > elif [ "$1" = "r" ]; then > mode=tunnel > daddr=9.9.6.120 > laddr=9.9.6.110 > xdaddr=9.9.7.120 > xladdr=9.9.7.110 > ispi=0x10 > ospi=0x20 > dev=veth1 > else > echo "Usage: $0 <l|r>" > exit 1 > fi > > ip xfrm state flush > ip xfrm policy flush > ip link set $dev up > ip addr add $laddr/24 dev $dev > ip link add xfrm0 type xfrm dev $dev if_id 3 > ip link set xfrm0 up > ip addr add $xladdr/24 dev xfrm0 > ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128 > ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128 > ip xfrm policy add dir in tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3 > ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3 > ``` > > Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com> > --- > net/ipv4/tcp_output.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 0e5b9a654254..61debda90f6d 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp, > { > u32 end_seq = TCP_SKB_CB(skb)->end_seq; > > + if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd) > + return true; This is not generic. What if tp->snd_wnd == (2 << tp->rx_opt.snd_wscale), for wscale == 10 ? > + > if (skb->len > cur_mss) > end_seq = TCP_SKB_CB(skb)->seq + cur_mss; > > @@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > } > > limit = mss_now; > - if (tso_segs > 1 && !tcp_urg_mode(tp)) > + if (!tcp_urg_mode(tp)) > limit = tcp_mss_split_point(sk, skb, mss_now, > cwnd_quota, > nonagle); > -- > 2.34.1 I think you are trying to solve the issue at the sender side, in the fast path, adding lots of cycles. While the issue seems to be a receive side one, failing to send a "win 0" at the right time/conditions. If the last ACK had a "win 1", I fail to see why a packet with length <= 2048 can not be received.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..61debda90f6d 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp, { u32 end_seq = TCP_SKB_CB(skb)->end_seq; + if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd) + return true; + if (skb->len > cur_mss) end_seq = TCP_SKB_CB(skb)->seq + cur_mss; @@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, } limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) + if (!tcp_urg_mode(tp)) limit = tcp_mss_split_point(sk, skb, mss_now, cwnd_quota, nonagle);
If the rcvbuf of a slow receiver is full, the packet will be dropped because tcp_try_rmem_schedule() cannot schedule more memory for it. Usually the scaled window size is not MSS aligned. If the receiver advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is disabled, we need at least two packets to fill it. But the receiver will not ACK the first one, and also do not offer a zero window since we never shrink the offered window. The sender waits for the ACK because the send window is not enough for another MSS sized packet, tcp_snd_wnd_test() will return false, and starts the TLP and then the retransmission timer for the first packet until it is ACKed. It may take a long time to resume transmission from retransmission after the receiver clears the rcvbuf, depends on the times of retransmissions. This issue should be rare today as GSO/TSO is a common technology, especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or commit d0d598ca86bd ("net: remove sk_route_forced_caps"). We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b). Force split a large packet and send it to fill the window so that the receiver can offer a zero window if he want. Reproduce: 1. Set a large number for net.core.rmem_max on the RECV side to provide a large wscale value of 11, which will provide a 2048 window larger than the normal MSS 1448. Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to quickly trigger the problem. (optional) sysctl net.core.rmem_max=67108864 sysctl net.ipv4.tcp_rmem="4096 131072 262144" 2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling the GSO/TSO of nic on the SEND side. (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes)) algo. (Namespace and veth is okay, helper xfrm.sh is at the end.) 3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support) (a) nc -l -p 11235 (b) nc -l -p 11235 -S 4. Send. (a) nc 9.9.6.110 11235 <bigdata (b) nc -S 9.9.7.110 11235 <bigdata 5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi) ESTAB 0 0 9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256 6. The sender remains in the retransmission state. (ss -tnpOHoemi) ESTAB 0 45104 9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048 Tcpdump: ``` 51:10.437 S > R: seq 1910971411, win 64240, length 0 51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0 51:10.439 S > R: ack 1, win 502, length 0 51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448 51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448 51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448 51:10.440 R > S: ack 2897, win 31, length 0 51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448 51:10.440 R > S: ack 4345, win 31, length 0 51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448 51:10.440 R > S: ack 7241, win 30, length 0 <...> 51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448 51:10.527 R > S: ack 87257, win 2, length 0 51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448 51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448 51:10.577 R > S: ack 90153, win 1, length 0 51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448 51:10.627 R > S: ack 91601, win 1, length 0 <...> 51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448 51:14.127 R > S: ack 192961, win 1, length 0 51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448 51:14.177 R > S: ack 194409, win 1, length 0 <rcvbuf full> 51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448 51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448 <clear rcvbuf> 51:51.504 R > S: ack 194409, win 2, length 0 <retransmission timer timeout> 52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448 52:20.242 R > S: ack 195857, win 3, length 0 <...> 52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448 52:20.245 R > S: ack 223369, win 30, length 0 ``` File: xfrm.sh ``` if [ "$1" = "l" ]; then mode=tunnel daddr=9.9.6.110 laddr=9.9.6.120 xdaddr=9.9.7.110 xladdr=9.9.7.120 ispi=0x20 ospi=0x10 dev=veth0 elif [ "$1" = "r" ]; then mode=tunnel daddr=9.9.6.120 laddr=9.9.6.110 xdaddr=9.9.7.120 xladdr=9.9.7.110 ispi=0x10 ospi=0x20 dev=veth1 else echo "Usage: $0 <l|r>" exit 1 fi ip xfrm state flush ip xfrm policy flush ip link set $dev up ip addr add $laddr/24 dev $dev ip link add xfrm0 type xfrm dev $dev if_id 3 ip link set xfrm0 up ip addr add $xladdr/24 dev xfrm0 ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128 ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128 ip xfrm policy add dir in tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3 ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3 ``` Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com> --- net/ipv4/tcp_output.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)