diff mbox series

[net] tcp: remove 64 KByte limit for initial tp->rcv_wnd value

Message ID 20240521134220.12510-1-kerneljasonxing@gmail.com (mailing list archive)
State Accepted
Commit 378979e94e953c2070acb4f0e0c98d29260bd09d
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: remove 64 KByte limit for initial tp->rcv_wnd value | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 907 this patch: 907
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 2 blamed authors not CCed: ycheng@google.com soheil@google.com; 2 maintainers not CCed: ycheng@google.com soheil@google.com
netdev/build_clang success Errors and warnings before: 911 this patch: 911
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 911 this patch: 911
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-21--21-00 (tests: 1039)

Commit Message

Jason Xing May 21, 2024, 1:42 p.m. UTC
From: Jason Xing <kernelxing@tencent.com>

Recently, we had some servers upgraded to the latest kernel and noticed
the indicator from the user side showed worse results than before. It is
caused by the limitation of tp->rcv_wnd.

In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
CDN teams would not benefit from this change because they cannot have a
large window to receive a big packet, which will be slowed down especially
in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
the side effect for the latency/time-sensitive users.

To avoid future confusion, current change doesn't affect the initial
receive window on the wire in a SYN or SYN+ACK packet which are set within
65535 bytes according to RFC 7323 also due to the limit in
__tcp_transmit_skb():

    th->window      = htons(min(tp->rcv_wnd, 65535U));

In one word, __tcp_transmit_skb() already ensures that constraint is
respected, no matter how large tp->rcv_wnd is. The change doesn't violate
RFC.

Let me provide one example if with or without the patch:
Before:
client   --- SYN: rwindow=65535 ---> server
client   <--- SYN+ACK: rwindow=65535 ----  server
client   --- ACK: rwindow=65536 ---> server
Note: for the last ACK, the calculation is 512 << 7.

After:
client   --- SYN: rwindow=65535 ---> server
client   <--- SYN+ACK: rwindow=65535 ----  server
client   --- ACK: rwindow=175232 ---> server
Note: I use the following command to make it work:
ip route change default via [ip] dev eth0 metric 100 initrwnd 120
For the last ACK, the calculation is 1369 << 7.

When we apply such a patch, having a large rcv_wnd if the user tweak this
knob can help transfer data more rapidly and save some rtts.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
v1
Link: https://lore.kernel.org/all/20240518025008.70689-1-kerneljasonxing@gmail.com/
1. refine the changelog (Eric)
2. add fixes tag to make sure the fix is backported (Eric)

RFC v2
Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
1. revise the title and body messages (Neal)
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet May 22, 2024, 6:52 a.m. UTC | #1
On Tue, May 21, 2024 at 3:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> Recently, we had some servers upgraded to the latest kernel and noticed
> the indicator from the user side showed worse results than before. It is
> caused by the limitation of tp->rcv_wnd.
>
> In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> CDN teams would not benefit from this change because they cannot have a
> large window to receive a big packet, which will be slowed down especially
> in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
> the side effect for the latency/time-sensitive users.
>
> To avoid future confusion, current change doesn't affect the initial
> receive window on the wire in a SYN or SYN+ACK packet which are set within
> 65535 bytes according to RFC 7323 also due to the limit in
> __tcp_transmit_skb():
>
>     th->window      = htons(min(tp->rcv_wnd, 65535U));
>
> In one word, __tcp_transmit_skb() already ensures that constraint is
> respected, no matter how large tp->rcv_wnd is. The change doesn't violate
> RFC.
>
> Let me provide one example if with or without the patch:
> Before:
> client   --- SYN: rwindow=65535 ---> server
> client   <--- SYN+ACK: rwindow=65535 ----  server
> client   --- ACK: rwindow=65536 ---> server
> Note: for the last ACK, the calculation is 512 << 7.
>
> After:
> client   --- SYN: rwindow=65535 ---> server
> client   <--- SYN+ACK: rwindow=65535 ----  server
> client   --- ACK: rwindow=175232 ---> server
> Note: I use the following command to make it work:
> ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> For the last ACK, the calculation is 1369 << 7.
>
> When we apply such a patch, having a large rcv_wnd if the user tweak this
> knob can help transfer data more rapidly and save some rtts.
>
> Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>
Neal Cardwell May 22, 2024, 2:08 p.m. UTC | #2
On Wed, May 22, 2024 at 2:52 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, May 21, 2024 at 3:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Recently, we had some servers upgraded to the latest kernel and noticed
> > the indicator from the user side showed worse results than before. It is
> > caused by the limitation of tp->rcv_wnd.
> >
> > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > CDN teams would not benefit from this change because they cannot have a
> > large window to receive a big packet, which will be slowed down especially
> > in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
> > the side effect for the latency/time-sensitive users.
> >
> > To avoid future confusion, current change doesn't affect the initial
> > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > 65535 bytes according to RFC 7323 also due to the limit in
> > __tcp_transmit_skb():
> >
> >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> >
> > In one word, __tcp_transmit_skb() already ensures that constraint is
> > respected, no matter how large tp->rcv_wnd is. The change doesn't violate
> > RFC.
> >
> > Let me provide one example if with or without the patch:
> > Before:
> > client   --- SYN: rwindow=65535 ---> server
> > client   <--- SYN+ACK: rwindow=65535 ----  server
> > client   --- ACK: rwindow=65536 ---> server
> > Note: for the last ACK, the calculation is 512 << 7.
> >
> > After:
> > client   --- SYN: rwindow=65535 ---> server
> > client   <--- SYN+ACK: rwindow=65535 ----  server
> > client   --- ACK: rwindow=175232 ---> server
> > Note: I use the following command to make it work:
> > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > For the last ACK, the calculation is 1369 << 7.
> >
> > When we apply such a patch, having a large rcv_wnd if the user tweak this
> > knob can help transfer data more rapidly and save some rtts.
> >
> > Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

Jason, thanks for the patch!

neal
patchwork-bot+netdevbpf@kernel.org May 23, 2024, 10:40 a.m. UTC | #3
Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Tue, 21 May 2024 21:42:20 +0800 you wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Recently, we had some servers upgraded to the latest kernel and noticed
> the indicator from the user side showed worse results than before. It is
> caused by the limitation of tp->rcv_wnd.
> 
> In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> CDN teams would not benefit from this change because they cannot have a
> large window to receive a big packet, which will be slowed down especially
> in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
> the side effect for the latency/time-sensitive users.
> 
> [...]

Here is the summary with links:
  - [net] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
    https://git.kernel.org/netdev/net/c/378979e94e95

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95caf8aaa8be..95618d0e78e4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -232,7 +232,7 @@  void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
 		(*rcv_wnd) = min(space, MAX_TCP_WINDOW);
 	else
-		(*rcv_wnd) = min_t(u32, space, U16_MAX);
+		(*rcv_wnd) = space;
 
 	if (init_rcv_wnd)
 		*rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);