diff mbox series

[RFC,net-next,2/2] tcp: introduce dynamic initcwnd adjustment

Message ID 20250328151633.30007-3-kerneljasonxing@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series tcp: support initcwnd adjustment | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 518 this patch: 518
netdev/build_tools success Errors and warnings before: 26 (+0) this patch: 26 (+0)
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 966 this patch: 966
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 15128 this patch: 15128
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 55 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 4 this patch: 4
netdev/source_inline success Was 0 now: 0

Commit Message

Jason Xing March 28, 2025, 3:16 p.m. UTC
From: Jason Xing <kernelxing@tencent.com>

More than one decade ago, Google published an important paper[1] that
describes how different initcwnd values have different impacts. Three
years later, initcwnd is set to 10 by default[2] for common use. But
nowadays, more and more small features have been developed for certain
particular cases instead of all the cases.

As we may notice some CDN teams try to increase it even to more than 100
for uncontrollable global network to speed up transmitting data in the slow
start phase. In data center, we also need such a similar change to ramp up
slow start especially for the case where application sometime tries to send
a small amount of data, say, 50K at one time in the persistent connection.
Asking users to tune by 'ip route' might not be that practical because 1)
it may affect those unwanted flows, 2) too big global-wide value may cause
burst for all kinds of flows.

This patch adds a dynamic adjustment feature for initcwnd in the slow start
or slow start from idle phase so that it only accelerates the in first
round trip time and doesn't affect too much for the massive data transfer
case.

Use 65535 as an upper bound to calculate the proper initcwnd. This number
is derived from the case where an skb carries the 65535 window when sending
syn ack at __tcp_transmit_skb(). Without it, the passive open side
sending data is able to see a very big value from the last ack in 3-WHS,
say, 2699776 which means it possibly generates a 1912 initcwnd that is
too big.

This patch can help the small data transfer case accelerate the speed. I
tested transmitting 50k at one time and managed to see the time consumed
decreased from 1400us to 80us. A 1750% delta!

The idea behind this is I often see the small data transfer consumes
more than 2 or 3 rtt because of limited snd_cwnd. In data center, we can
afford the bandwidth if we choose to accelerate transmission.

Why I chose the tp->max_window/tp->mss_cache? It's because cwnd is
increased by per mss packet and max_window is the signal that the other
side tries to tell us the max capacity it can bear. As we can see at
tcp_set_skb_tso_segs(), tcp_gso_size is equal to mss.

[1]: https://developers.google.com/speed/protocols/tcp_initcwnd_techreport.pdf
[2]: https://datatracker.ietf.org/doc/html/rfc6928

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
I'm not sure what the upper bound of this window should be. 65535 used
as max window generates a 46 initcwnd with the 1412 mss in my vm.
---
 include/linux/tcp.h      |  3 ++-
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           |  8 ++++++++
 net/ipv4/tcp_input.c     | 11 +++++++++--
 4 files changed, 20 insertions(+), 3 deletions(-)
diff mbox series

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index aba0a1fe0e36..445db706f3cd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -385,7 +385,8 @@  struct tcp_sock {
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
 		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
-		syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
+		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
+		dynamic_initcwnd:1;  /* dynamic adjustment for initcwnd */
 
 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
 	u32	tcp_tx_delay;	/* delay (in usec) added to TX packets */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index acf77114efed..7c63d0d0b5e1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -143,6 +143,7 @@  enum {
 #define TCP_RTO_MIN_US		45	/* min rto time in us */
 #define TCP_DELACK_MAX_US	46	/* max delayed ack time in us */
 #define TCP_IW			47	/* initial congestion window */
+#define TCP_IW_DYNAMIC         48      /* dynamic adjustment for initcwnd */
 
 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF		0
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9da7ece57b20..3d419a714f2d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3868,6 +3868,11 @@  int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EINVAL;
 		tp->init_cwnd = val;
 		return 0;
+	case TCP_IW_DYNAMIC:
+		if (val < 0 || val > 1)
+			return -EINVAL;
+		tp->dynamic_initcwnd = val;
+		return 0;
 	}
 
 	sockopt_lock_sock(sk);
@@ -4716,6 +4721,9 @@  int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_IW:
 		val = tp->init_cwnd;
 		break;
+	case TCP_IW_DYNAMIC:
+		val = tp->dynamic_initcwnd;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 00cbe8970a1b..05dbec734aa5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6341,10 +6341,17 @@  void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb)
 	 * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
 	 * retransmission has occurred.
 	 */
-	if (tp->total_retrans > 1 && tp->undo_marker)
+	if (tp->total_retrans > 1 && tp->undo_marker) {
 		tcp_snd_cwnd_set(tp, 1);
-	else
+	} else {
+		if (tp->dynamic_initcwnd) {
+			u32 win = min(tp->max_window, 65535);
+
+			tp->init_cwnd = max(win / tp->mss_cache, TCP_INIT_CWND);
+		}
+
 		tcp_snd_cwnd_set(tp, tcp_init_cwnd(tp, __sk_dst_get(sk)));
+	}
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
 	bpf_skops_established(sk, bpf_op, skb);