Message ID | 20230509180558.2541885-1-morleyd.kernel@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | ccce324dabfe2143519daf50ed8b1ef1d0c542f7 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: make the first N SYN RTO backoffs linear | expand |
Hello: This patch was applied to netdev/net-next.git (main) by Paolo Abeni <pabeni@redhat.com>: On Tue, 9 May 2023 18:05:58 +0000 you wrote: > From: David Morley <morleyd@google.com> > > Currently the SYN RTO schedule follows an exponential backoff > scheme, which can be unnecessarily conservative in cases where > there are link failures. In such cases, it's better to > aggressively try to retransmit packets, so it takes routers > less time to find a repath with a working link. > > [...] Here is the summary with links: - [net-next] tcp: make the first N SYN RTO backoffs linear https://git.kernel.org/netdev/net-next/c/ccce324dabfe You are awesome, thank you!
On Tue, May 9, 2023 at 8:06 PM David Morley <morleyd.kernel@gmail.com> wrote: > > From: David Morley <morleyd@google.com> > > Currently the SYN RTO schedule follows an exponential backoff > scheme, which can be unnecessarily conservative in cases where > there are link failures. In such cases, it's better to > aggressively try to retransmit packets, so it takes routers > less time to find a repath with a working link. > > We chose a default value for this sysctl of 4, to follow > the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ... > MacOS and IOS have used this backoff schedule for over > a decade, since before this 2009 IETF presentation > discussed the behavior: > https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf > > This commit makes the SYN RTO schedule start with a number of > linear backoffs given by the following sysctl: > * tcp_syn_linear_timeouts > > This changes the SYN RTO scheme to be: init_rto_val for > tcp_syn_linear_timeouts, exp backoff starting at init_rto_val > > For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our > backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ... > > Signed-off-by: David Morley <morleyd@google.com> > Signed-off-by: Yuchung Cheng <ycheng@google.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > Tested-by: David Morley <morleyd@google.com> > Reviewed-by: Eric Dumazet <edumazet@google.com> > --- > Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++--- > include/net/netns/ipv4.h | 1 + > net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_timer.c | 17 +++++++++++++---- > 5 files changed, 39 insertions(+), 7 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > index 6ec06a33688a..3f6d3d5f5626 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs > tcp_syn_retries - INTEGER > Number of times initial SYNs for an active TCP connection attempt > will be retransmitted. Should not be higher than 127. Default value > - is 6, which corresponds to 63seconds till the last retransmission > - with the current initial RTO of 1second. With this the final timeout > - for an active TCP connection attempt will happen after 127seconds. > + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) > + till the last retransmission with the current initial RTO of 1second. > + With this the final timeout for an active TCP connection attempt > + will happen after 131seconds. > > tcp_timestamps - INTEGER > Enable timestamps as defined in RFC1323. > @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER > > Default: 120 > > +tcp_syn_linear_timeouts - INTEGER > + The number of times for an active TCP connection to retransmit SYNs with > + a linear backoff timeout before defaulting to an exponential backoff > + timeout. This has no effect on SYNACK at the passive TCP side. > + > + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would > + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, > + and the first exponential backoff using 2^0 * initial_RTO). > + Default: 4 > + > tcp_tso_win_divisor - INTEGER > This allows control over what percentage of the congestion window > can be consumed by a single TSO frame. > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index db762e35aca9..a4efb7a2796c 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -194,6 +194,7 @@ struct netns_ipv4 { > int sysctl_udp_rmem_min; > > u8 sysctl_fib_notify_on_flag_change; > + u8 sysctl_tcp_syn_linear_timeouts; > > #ifdef CONFIG_NET_L3_MASTER_DEV > u8 sysctl_udp_l3mdev_accept; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 40fe70fc2015..6ae3345a3bdf 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -34,6 +34,7 @@ static int ip_ttl_min = 1; > static int ip_ttl_max = 255; > static int tcp_syn_retries_min = 1; > static int tcp_syn_retries_max = MAX_TCP_SYNCNT; > +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT; > static int ip_ping_group_range_min[] = { 0, 0 }; > static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; > static u32 u32_max_div_HZ = UINT_MAX / HZ; > @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = &tcp_plb_max_cong_thresh, > }, > + { > + .procname = "tcp_syn_linear_timeouts", > + .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts, > + .maxlen = sizeof(u8), > + .mode = 0644, > + .proc_handler = proc_dou8vec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = &tcp_syn_linear_timeouts_max, > + }, > { } > }; > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 39bda2b1066e..db24ed8f8509 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net) > else > net->ipv4.tcp_congestion_control = &tcp_reno; > > + net->ipv4.sysctl_tcp_syn_linear_timeouts = 4; > return 0; > } > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > index b839c2f91292..0d93a2573807 100644 > --- a/net/ipv4/tcp_timer.c > +++ b/net/ipv4/tcp_timer.c > @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk) > struct tcp_sock *tp = tcp_sk(sk); > struct net *net = sock_net(sk); > bool expired = false, do_reset; > - int retry_until; > + int retry_until, max_retransmits; > > if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { > if (icsk->icsk_retransmits) > __dst_negative_advice(sk); > retry_until = icsk->icsk_syn_retries ? : > READ_ONCE(net->ipv4.sysctl_tcp_syn_retries); > - expired = icsk->icsk_retransmits >= retry_until; > + > + max_retransmits = retry_until; > + if (sk->sk_state == TCP_SYN_SENT) > + max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts); > + > + expired = icsk->icsk_retransmits >= max_retransmits; > } else { > if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) { > /* Black hole detection */ > @@ -577,8 +582,12 @@ void tcp_retransmit_timer(struct sock *sk) > icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) { > icsk->icsk_backoff = 0; > icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX); > - } else { > - /* Use normal (exponential) backoff */ > + } else if (sk->sk_state != TCP_SYN_SENT || > + icsk->icsk_backoff > > + READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) { > + /* Use normal (exponential) backoff unless linear timeouts are > + * activated. > + */ Hi David, back to this patch. If sysctl_tcp_syn_linear_timeouts is set to a high value, we could end up with icsk->icsk_backoff > 64 This can generate various overflows later, eg from inet_csk_rto_backoff(), called from tcp_ld_RTO_revert() More generally tcp_ld_RTO_revert() is not aware of linear timeouts. Thank you.
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 6ec06a33688a..3f6d3d5f5626 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) + till the last retransmission with the current initial RTO of 1second. + With this the final timeout for an active TCP connection attempt + will happen after 131seconds. tcp_timestamps - INTEGER Enable timestamps as defined in RFC1323. @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER Default: 120 +tcp_syn_linear_timeouts - INTEGER + The number of times for an active TCP connection to retransmit SYNs with + a linear backoff timeout before defaulting to an exponential backoff + timeout. This has no effect on SYNACK at the passive TCP side. + + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, + and the first exponential backoff using 2^0 * initial_RTO). + Default: 4 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index db762e35aca9..a4efb7a2796c 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -194,6 +194,7 @@ struct netns_ipv4 { int sysctl_udp_rmem_min; u8 sysctl_fib_notify_on_flag_change; + u8 sysctl_tcp_syn_linear_timeouts; #ifdef CONFIG_NET_L3_MASTER_DEV u8 sysctl_udp_l3mdev_accept; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 40fe70fc2015..6ae3345a3bdf 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -34,6 +34,7 @@ static int ip_ttl_min = 1; static int ip_ttl_max = 255; static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; static u32 u32_max_div_HZ = UINT_MAX / HZ; @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = &tcp_plb_max_cong_thresh, }, + { + .procname = "tcp_syn_linear_timeouts", + .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = proc_dou8vec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &tcp_syn_linear_timeouts_max, + }, { } }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 39bda2b1066e..db24ed8f8509 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net) else net->ipv4.tcp_congestion_control = &tcp_reno; + net->ipv4.sysctl_tcp_syn_linear_timeouts = 4; return 0; } diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index b839c2f91292..0d93a2573807 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); struct net *net = sock_net(sk); bool expired = false, do_reset; - int retry_until; + int retry_until, max_retransmits; if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { if (icsk->icsk_retransmits) __dst_negative_advice(sk); retry_until = icsk->icsk_syn_retries ? : READ_ONCE(net->ipv4.sysctl_tcp_syn_retries); - expired = icsk->icsk_retransmits >= retry_until; + + max_retransmits = retry_until; + if (sk->sk_state == TCP_SYN_SENT) + max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts); + + expired = icsk->icsk_retransmits >= max_retransmits; } else { if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) { /* Black hole detection */ @@ -577,8 +582,12 @@ void tcp_retransmit_timer(struct sock *sk) icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) { icsk->icsk_backoff = 0; icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX); - } else { - /* Use normal (exponential) backoff */ + } else if (sk->sk_state != TCP_SYN_SENT || + icsk->icsk_backoff > + READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) { + /* Use normal (exponential) backoff unless linear timeouts are + * activated. + */ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); } inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,