diff mbox series

[net-next] tcp: make the first N SYN RTO backoffs linear

Message ID 20230509180558.2541885-1-morleyd.kernel@gmail.com (mailing list archive)
State Accepted
Commit ccce324dabfe2143519daf50ed8b1ef1d0c542f7
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: make the first N SYN RTO backoffs linear | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4399 this patch: 4399
netdev/cc_maintainers warning 6 maintainers not CCed: mubashirq@google.com dsahern@kernel.org kuniyu@amazon.com corbet@lwn.net pabeni@redhat.com linux-doc@vger.kernel.org
netdev/build_clang success Errors and warnings before: 978 this patch: 978
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4621 this patch: 4621
netdev/checkpatch warning WARNING: line length of 95 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

David Morley May 9, 2023, 6:05 p.m. UTC
From: David Morley <morleyd@google.com>

Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++---
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_timer.c                   | 17 +++++++++++++----
 5 files changed, 39 insertions(+), 7 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org May 11, 2023, 8:50 a.m. UTC | #1
Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Tue,  9 May 2023 18:05:58 +0000 you wrote:
> From: David Morley <morleyd@google.com>
> 
> Currently the SYN RTO schedule follows an exponential backoff
> scheme, which can be unnecessarily conservative in cases where
> there are link failures. In such cases, it's better to
> aggressively try to retransmit packets, so it takes routers
> less time to find a repath with a working link.
> 
> [...]

Here is the summary with links:
  - [net-next] tcp: make the first N SYN RTO backoffs linear
    https://git.kernel.org/netdev/net-next/c/ccce324dabfe

You are awesome, thank you!
Eric Dumazet Nov. 7, 2023, 8:26 p.m. UTC | #2
On Tue, May 9, 2023 at 8:06 PM David Morley <morleyd.kernel@gmail.com> wrote:
>
> From: David Morley <morleyd@google.com>
>
> Currently the SYN RTO schedule follows an exponential backoff
> scheme, which can be unnecessarily conservative in cases where
> there are link failures. In such cases, it's better to
> aggressively try to retransmit packets, so it takes routers
> less time to find a repath with a working link.
>
> We chose a default value for this sysctl of 4, to follow
> the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
> MacOS and IOS have used this backoff schedule for over
> a decade, since before this 2009 IETF presentation
> discussed the behavior:
> https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
>
> This commit makes the SYN RTO schedule start with a number of
> linear backoffs given by the following sysctl:
> * tcp_syn_linear_timeouts
>
> This changes the SYN RTO scheme to be: init_rto_val for
> tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
>
> For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
> backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
>
> Signed-off-by: David Morley <morleyd@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: David Morley <morleyd@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++---
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_timer.c                   | 17 +++++++++++++----
>  5 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6ec06a33688a..3f6d3d5f5626 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
>  tcp_syn_retries - INTEGER
>         Number of times initial SYNs for an active TCP connection attempt
>         will be retransmitted. Should not be higher than 127. Default value
> -       is 6, which corresponds to 63seconds till the last retransmission
> -       with the current initial RTO of 1second. With this the final timeout
> -       for an active TCP connection attempt will happen after 127seconds.
> +       is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
> +       till the last retransmission with the current initial RTO of 1second.
> +       With this the final timeout for an active TCP connection attempt
> +       will happen after 131seconds.
>
>  tcp_timestamps - INTEGER
>         Enable timestamps as defined in RFC1323.
> @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER
>
>         Default: 120
>
> +tcp_syn_linear_timeouts - INTEGER
> +       The number of times for an active TCP connection to retransmit SYNs with
> +       a linear backoff timeout before defaulting to an exponential backoff
> +       timeout. This has no effect on SYNACK at the passive TCP side.
> +
> +       With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
> +       expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
> +       and the first exponential backoff using 2^0 * initial_RTO).
> +       Default: 4
> +
>  tcp_tso_win_divisor - INTEGER
>         This allows control over what percentage of the congestion window
>         can be consumed by a single TSO frame.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index db762e35aca9..a4efb7a2796c 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -194,6 +194,7 @@ struct netns_ipv4 {
>         int sysctl_udp_rmem_min;
>
>         u8 sysctl_fib_notify_on_flag_change;
> +       u8 sysctl_tcp_syn_linear_timeouts;
>
>  #ifdef CONFIG_NET_L3_MASTER_DEV
>         u8 sysctl_udp_l3mdev_accept;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 40fe70fc2015..6ae3345a3bdf 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -34,6 +34,7 @@ static int ip_ttl_min = 1;
>  static int ip_ttl_max = 255;
>  static int tcp_syn_retries_min = 1;
>  static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
> +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
>  static int ip_ping_group_range_min[] = { 0, 0 };
>  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
>  static u32 u32_max_div_HZ = UINT_MAX / HZ;
> @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ZERO,
>                 .extra2         = &tcp_plb_max_cong_thresh,
>         },
> +       {
> +               .procname = "tcp_syn_linear_timeouts",
> +               .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts,
> +               .maxlen = sizeof(u8),
> +               .mode = 0644,
> +               .proc_handler = proc_dou8vec_minmax,
> +               .extra1 = SYSCTL_ZERO,
> +               .extra2 = &tcp_syn_linear_timeouts_max,
> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 39bda2b1066e..db24ed8f8509 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         else
>                 net->ipv4.tcp_congestion_control = &tcp_reno;
>
> +       net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
>         return 0;
>  }
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b839c2f91292..0d93a2573807 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
>         bool expired = false, do_reset;
> -       int retry_until;
> +       int retry_until, max_retransmits;
>
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>                 if (icsk->icsk_retransmits)
>                         __dst_negative_advice(sk);
>                 retry_until = icsk->icsk_syn_retries ? :
>                         READ_ONCE(net->ipv4.sysctl_tcp_syn_retries);
> -               expired = icsk->icsk_retransmits >= retry_until;
> +
> +               max_retransmits = retry_until;
> +               if (sk->sk_state == TCP_SYN_SENT)
> +                       max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts);
> +
> +               expired = icsk->icsk_retransmits >= max_retransmits;
>         } else {
>                 if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) {
>                         /* Black hole detection */
> @@ -577,8 +582,12 @@ void tcp_retransmit_timer(struct sock *sk)
>             icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
>                 icsk->icsk_backoff = 0;
>                 icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
> -       } else {
> -               /* Use normal (exponential) backoff */
> +       } else if (sk->sk_state != TCP_SYN_SENT ||
> +                  icsk->icsk_backoff >
> +                  READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) {
> +               /* Use normal (exponential) backoff unless linear timeouts are
> +                * activated.
> +                */

Hi David, back to this patch.

If sysctl_tcp_syn_linear_timeouts is set to a high value, we could end up with
 icsk->icsk_backoff > 64

This can generate various overflows later, eg from inet_csk_rto_backoff(),
called from tcp_ld_RTO_revert()

More generally tcp_ld_RTO_revert() is not aware of linear timeouts.

Thank you.
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6ec06a33688a..3f6d3d5f5626 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -881,9 +881,10 @@  tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
 tcp_syn_retries - INTEGER
 	Number of times initial SYNs for an active TCP connection attempt
 	will be retransmitted. Should not be higher than 127. Default value
-	is 6, which corresponds to 63seconds till the last retransmission
-	with the current initial RTO of 1second. With this the final timeout
-	for an active TCP connection attempt will happen after 127seconds.
+	is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
+	till the last retransmission with the current initial RTO of 1second.
+	With this the final timeout for an active TCP connection attempt
+	will happen after 131seconds.
 
 tcp_timestamps - INTEGER
 	Enable timestamps as defined in RFC1323.
@@ -946,6 +947,16 @@  tcp_pacing_ca_ratio - INTEGER
 
 	Default: 120
 
+tcp_syn_linear_timeouts - INTEGER
+	The number of times for an active TCP connection to retransmit SYNs with
+	a linear backoff timeout before defaulting to an exponential backoff
+	timeout. This has no effect on SYNACK at the passive TCP side.
+
+	With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
+	expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
+	and the first exponential backoff using 2^0 * initial_RTO).
+	Default: 4
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index db762e35aca9..a4efb7a2796c 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -194,6 +194,7 @@  struct netns_ipv4 {
 	int sysctl_udp_rmem_min;
 
 	u8 sysctl_fib_notify_on_flag_change;
+	u8 sysctl_tcp_syn_linear_timeouts;
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
 	u8 sysctl_udp_l3mdev_accept;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 40fe70fc2015..6ae3345a3bdf 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -34,6 +34,7 @@  static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
 static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
+static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 static u32 u32_max_div_HZ = UINT_MAX / HZ;
@@ -1470,6 +1471,15 @@  static struct ctl_table ipv4_net_table[] = {
 		.extra1         = SYSCTL_ZERO,
 		.extra2         = &tcp_plb_max_cong_thresh,
 	},
+	{
+		.procname = "tcp_syn_linear_timeouts",
+		.data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts,
+		.maxlen = sizeof(u8),
+		.mode = 0644,
+		.proc_handler = proc_dou8vec_minmax,
+		.extra1 = SYSCTL_ZERO,
+		.extra2 = &tcp_syn_linear_timeouts_max,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 39bda2b1066e..db24ed8f8509 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3275,6 +3275,7 @@  static int __net_init tcp_sk_init(struct net *net)
 	else
 		net->ipv4.tcp_congestion_control = &tcp_reno;
 
+	net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index b839c2f91292..0d93a2573807 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -234,14 +234,19 @@  static int tcp_write_timeout(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
 	bool expired = false, do_reset;
-	int retry_until;
+	int retry_until, max_retransmits;
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
 			__dst_negative_advice(sk);
 		retry_until = icsk->icsk_syn_retries ? :
 			READ_ONCE(net->ipv4.sysctl_tcp_syn_retries);
-		expired = icsk->icsk_retransmits >= retry_until;
+
+		max_retransmits = retry_until;
+		if (sk->sk_state == TCP_SYN_SENT)
+			max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts);
+
+		expired = icsk->icsk_retransmits >= max_retransmits;
 	} else {
 		if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) {
 			/* Black hole detection */
@@ -577,8 +582,12 @@  void tcp_retransmit_timer(struct sock *sk)
 	    icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
 		icsk->icsk_backoff = 0;
 		icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
-	} else {
-		/* Use normal (exponential) backoff */
+	} else if (sk->sk_state != TCP_SYN_SENT ||
+		   icsk->icsk_backoff >
+		   READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) {
+		/* Use normal (exponential) backoff unless linear timeouts are
+		 * activated.
+		 */
 		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
 	}
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,