diff mbox series

[net-next] tcp: introduce rto_max_us sysctl knob

Message ID 20240715033118.32322-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: introduce rto_max_us sysctl knob | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 861 this patch: 861
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 2 maintainers not CCed: linux-doc@vger.kernel.org lixiaoyan@google.com
netdev/build_clang success Errors and warnings before: 924 this patch: 924
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5159 this patch: 5159
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-07-15--06-00 (tests: 695)

Commit Message

Jason Xing July 15, 2024, 3:31 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

As we all know, the algorithm of rto is exponential backoff as RFC
defined long time ago. After several rounds of repeatedly transmitting
a lost skb, the expiry of rto timer could reach above 1 second within
the upper bound (120s).

Waiting more than one second to retransmit for some latency-sensitive
application is a little bit unacceptable nowadays, so I decided to
introduce a sysctl knob to allow users to tune it. Still, the maximum
value is 120 seconds.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 Documentation/networking/ip-sysctl.rst | 10 ++++++++++
 include/net/inet_connection_sock.h     | 10 +++++++++-
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 5 files changed, 29 insertions(+), 1 deletion(-)

Comments

Eric Dumazet July 15, 2024, 2:40 p.m. UTC | #1
On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> As we all know, the algorithm of rto is exponential backoff as RFC
> defined long time ago.

This is weak sentence. Please provide RFC numbers instead.

> After several rounds of repeatedly transmitting
> a lost skb, the expiry of rto timer could reach above 1 second within
> the upper bound (120s).

This is confusing. What do you mean exactly ?

>
> Waiting more than one second to retransmit for some latency-sensitive
> application is a little bit unacceptable nowadays, so I decided to
> introduce a sysctl knob to allow users to tune it. Still, the maximum
> value is 120 seconds.

I do not think this sysctl needs usec resolution.

Also storing this sysctl once, and converting it millions of times per
second to jiffies is not good.
I suggest you use proc_dointvec_jiffies() instead in the sysctl handler.

Minimal value of one jiffies makes also no sense. We can not predict
if some distros/users
might (ab)use this sysctl.

You forgot to update
Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
This means the location you chose for the new sysctl is pretty much
random and not reflcting
this is used in one fast path.

I suggest you wait for net-next being reopened, we are all busy
attending netdev 0x18 conference.
Jason Xing July 16, 2024, 6:41 a.m. UTC | #2
Hello Eric,

On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > As we all know, the algorithm of rto is exponential backoff as RFC
> > defined long time ago.
>
> This is weak sentence. Please provide RFC numbers instead.

RFC 6298. I will update it.

>
> > After several rounds of repeatedly transmitting
> > a lost skb, the expiry of rto timer could reach above 1 second within
> > the upper bound (120s).
>
> This is confusing. What do you mean exactly ?

I will rewrite this part. What I was trying to say is that waiting
more than 1 second is not very friendly to some applications,
especially the expiry time can reach 120 seconds which is too long.

>
> >
> > Waiting more than one second to retransmit for some latency-sensitive
> > application is a little bit unacceptable nowadays, so I decided to
> > introduce a sysctl knob to allow users to tune it. Still, the maximum
> > value is 120 seconds.
>
> I do not think this sysctl needs usec resolution.

Are you suggesting using jiffies is enough? But I have two reasons:
1) Keep the consistency with rto_min_us
2) If rto_min_us can be set to a very small value, why not rto_max?

What do you think?

>
> Also storing this sysctl once, and converting it millions of times per
> second to jiffies is not good.
> I suggest you use proc_dointvec_jiffies() instead in the sysctl handler.
>
> Minimal value of one jiffies makes also no sense. We can not predict
> if some distros/users
> might (ab)use this sysctl.

Okay. If the final solution is using jiffies, I will accordingly adjust it.

>
> You forgot to update
> Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst

Oh sorry, I forgot.

> This means the location you chose for the new sysctl is pretty much
> random and not reflcting
> this is used in one fast path.

I will investigate its proper location...

>
> I suggest you wait for net-next being reopened, we are all busy
> attending netdev 0x18 conference.

Roger that. Thanks for your suggestions.

Thanks,
Jason
Eric Dumazet July 16, 2024, 6:57 a.m. UTC | #3
On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > As we all know, the algorithm of rto is exponential backoff as RFC
> > > defined long time ago.
> >
> > This is weak sentence. Please provide RFC numbers instead.
>
> RFC 6298. I will update it.
>
> >
> > > After several rounds of repeatedly transmitting
> > > a lost skb, the expiry of rto timer could reach above 1 second within
> > > the upper bound (120s).
> >
> > This is confusing. What do you mean exactly ?
>
> I will rewrite this part. What I was trying to say is that waiting
> more than 1 second is not very friendly to some applications,
> especially the expiry time can reach 120 seconds which is too long.

Says who ? I think this needs IETF approval.

>
> >
> > >
> > > Waiting more than one second to retransmit for some latency-sensitive
> > > application is a little bit unacceptable nowadays, so I decided to
> > > introduce a sysctl knob to allow users to tune it. Still, the maximum
> > > value is 120 seconds.
> >
> > I do not think this sysctl needs usec resolution.
>
> Are you suggesting using jiffies is enough? But I have two reasons:
> 1) Keep the consistency with rto_min_us
> 2) If rto_min_us can be set to a very small value, why not rto_max?

rto_max is usually 3 order of magnitude higher than rto_min

For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms.
Some of us use 5 msec rto_min.

jiffies is plain enough for rto_max.


>
> What do you think?

I think you missed many many details really.

Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask
yourself how many things
will break if we allow a sysctl value with 1 second for rto_max.

>
> >
> > Also storing this sysctl once, and converting it millions of times per
> > second to jiffies is not good.
> > I suggest you use proc_dointvec_jiffies() instead in the sysctl handler.
> >
> > Minimal value of one jiffies makes also no sense. We can not predict
> > if some distros/users
> > might (ab)use this sysctl.
>
> Okay. If the final solution is using jiffies, I will accordingly adjust it.
>
> >
> > You forgot to update
> > Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>
> Oh sorry, I forgot.
>
> > This means the location you chose for the new sysctl is pretty much
> > random and not reflcting
> > this is used in one fast path.
>
> I will investigate its proper location...
>
> >
> > I suggest you wait for net-next being reopened, we are all busy
> > attending netdev 0x18 conference.
>
> Roger that. Thanks for your suggestions.
>
> Thanks,
> Jason
Jason Xing July 16, 2024, 7:14 a.m. UTC | #4
On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > As we all know, the algorithm of rto is exponential backoff as RFC
> > > > defined long time ago.
> > >
> > > This is weak sentence. Please provide RFC numbers instead.
> >
> > RFC 6298. I will update it.
> >
> > >
> > > > After several rounds of repeatedly transmitting
> > > > a lost skb, the expiry of rto timer could reach above 1 second within
> > > > the upper bound (120s).
> > >
> > > This is confusing. What do you mean exactly ?
> >
> > I will rewrite this part. What I was trying to say is that waiting
> > more than 1 second is not very friendly to some applications,
> > especially the expiry time can reach 120 seconds which is too long.
>
> Says who ? I think this needs IETF approval.

Did you get me wrong? I mean this rto_max is the same as rto_min_us,
which can be tuned by users.

>
> >
> > >
> > > >
> > > > Waiting more than one second to retransmit for some latency-sensitive
> > > > application is a little bit unacceptable nowadays, so I decided to
> > > > introduce a sysctl knob to allow users to tune it. Still, the maximum
> > > > value is 120 seconds.
> > >
> > > I do not think this sysctl needs usec resolution.
> >
> > Are you suggesting using jiffies is enough? But I have two reasons:
> > 1) Keep the consistency with rto_min_us
> > 2) If rto_min_us can be set to a very small value, why not rto_max?
>
> rto_max is usually 3 order of magnitude higher than rto_min
>
> For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms.
> Some of us use 5 msec rto_min.
>
> jiffies is plain enough for rto_max.

I got it. Thanks.

>
>
> >
> > What do you think?
>
> I think you missed many many details really.
>
> Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask
> yourself how many things
> will break if we allow a sysctl value with 1 second for rto_max.

I'm not modifying the TCP_RTO_MAX value which is tooooo complicated.
Instead, I'm trying to control the maximum expiry time in the
ICSK_TIME_RETRANS timer. So it's only involved in three cases:
1) syn retrans
2) synack retrans
3) data retrans

Thanks,
Jason
Jason Xing July 16, 2024, 8:11 a.m. UTC | #5
On Tue, Jul 16, 2024 at 3:14 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > Hello Eric,
> > >
> > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > As we all know, the algorithm of rto is exponential backoff as RFC
> > > > > defined long time ago.
> > > >
> > > > This is weak sentence. Please provide RFC numbers instead.
> > >
> > > RFC 6298. I will update it.
> > >
> > > >
> > > > > After several rounds of repeatedly transmitting
> > > > > a lost skb, the expiry of rto timer could reach above 1 second within
> > > > > the upper bound (120s).
> > > >
> > > > This is confusing. What do you mean exactly ?
> > >
> > > I will rewrite this part. What I was trying to say is that waiting
> > > more than 1 second is not very friendly to some applications,
> > > especially the expiry time can reach 120 seconds which is too long.
> >
> > Says who ? I think this needs IETF approval.
>
> Did you get me wrong? I mean this rto_max is the same as rto_min_us,
> which can be tuned by users.
>
> >
> > >
> > > >
> > > > >
> > > > > Waiting more than one second to retransmit for some latency-sensitive
> > > > > application is a little bit unacceptable nowadays, so I decided to
> > > > > introduce a sysctl knob to allow users to tune it. Still, the maximum
> > > > > value is 120 seconds.
> > > >
> > > > I do not think this sysctl needs usec resolution.
> > >
> > > Are you suggesting using jiffies is enough? But I have two reasons:
> > > 1) Keep the consistency with rto_min_us
> > > 2) If rto_min_us can be set to a very small value, why not rto_max?
> >
> > rto_max is usually 3 order of magnitude higher than rto_min
> >
> > For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms.
> > Some of us use 5 msec rto_min.
> >
> > jiffies is plain enough for rto_max.
>
> I got it. Thanks.
>
> >
> >
> > >
> > > What do you think?
> >
> > I think you missed many many details really.
> >
> > Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask
> > yourself how many things
> > will break if we allow a sysctl value with 1 second for rto_max.
>
> I'm not modifying the TCP_RTO_MAX value which is tooooo complicated.
> Instead, I'm trying to control the maximum expiry time in the
> ICSK_TIME_RETRANS timer. So it's only involved in three cases:
> 1) syn retrans
> 2) synack retrans
> 3) data retrans

To be clearer, my initial goal is to accelerate the speed of
retransmitting data. There is a simpler way to implement it only in
tcp_retransmit_timer().
Jason Xing July 16, 2024, 8:26 a.m. UTC | #6
On Tue, Jul 16, 2024 at 4:11 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Jul 16, 2024 at 3:14 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > Hello Eric,
> > > >
> > > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > As we all know, the algorithm of rto is exponential backoff as RFC
> > > > > > defined long time ago.
> > > > >
> > > > > This is weak sentence. Please provide RFC numbers instead.
> > > >
> > > > RFC 6298. I will update it.
> > > >
> > > > >
> > > > > > After several rounds of repeatedly transmitting
> > > > > > a lost skb, the expiry of rto timer could reach above 1 second within
> > > > > > the upper bound (120s).
> > > > >
> > > > > This is confusing. What do you mean exactly ?
> > > >
> > > > I will rewrite this part. What I was trying to say is that waiting
> > > > more than 1 second is not very friendly to some applications,
> > > > especially the expiry time can reach 120 seconds which is too long.
> > >
> > > Says who ? I think this needs IETF approval.
> >
> > Did you get me wrong? I mean this rto_max is the same as rto_min_us,
> > which can be tuned by users.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Waiting more than one second to retransmit for some latency-sensitive
> > > > > > application is a little bit unacceptable nowadays, so I decided to
> > > > > > introduce a sysctl knob to allow users to tune it. Still, the maximum
> > > > > > value is 120 seconds.
> > > > >
> > > > > I do not think this sysctl needs usec resolution.
> > > >
> > > > Are you suggesting using jiffies is enough? But I have two reasons:
> > > > 1) Keep the consistency with rto_min_us
> > > > 2) If rto_min_us can be set to a very small value, why not rto_max?
> > >
> > > rto_max is usually 3 order of magnitude higher than rto_min
> > >
> > > For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms.
> > > Some of us use 5 msec rto_min.
> > >
> > > jiffies is plain enough for rto_max.
> >
> > I got it. Thanks.
> >
> > >
> > >
> > > >
> > > > What do you think?
> > >
> > > I think you missed many many details really.
> > >
> > > Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask
> > > yourself how many things
> > > will break if we allow a sysctl value with 1 second for rto_max.
> >
> > I'm not modifying the TCP_RTO_MAX value which is tooooo complicated.
> > Instead, I'm trying to control the maximum expiry time in the
> > ICSK_TIME_RETRANS timer. So it's only involved in three cases:
> > 1) syn retrans
> > 2) synack retrans
> > 3) data retrans
>
> To be clearer, my initial goal is to accelerate the speed of
> retransmitting data. There is a simpler way to implement it only in
> tcp_retransmit_timer().

Sorry to keep replying to this thread so frequently. I have to admit
tuning rto max is much more complicated than rto min. I think simply
reducing the time of retrans is not a feasible way.
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 3616389c8c2d..32a1907ca95d 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1223,6 +1223,16 @@  tcp_rto_min_us - INTEGER
 
 	Default: 200000
 
+tcp_rto_max_us - INTEGER
+	Maximum TCP retransmission timeout (in microseconds).
+
+	The recommended practice is to use a value less or equal to 120000000
+	microseconds.
+
+	Possible Values: 1 - INT_MAX
+
+	Default: 120000000
+
 UDP variables
 =============
 
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c0deaafebfdc..a0abbafcab9e 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -217,10 +217,18 @@  static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what)
  */
 static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 					     unsigned long when,
-					     const unsigned long max_when)
+					     unsigned long max_when)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
+	if (what == ICSK_TIME_RETRANS) {
+		int rto_max_us = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rto_max_us);
+		unsigned int rto_max = usecs_to_jiffies(rto_max_us);
+
+		if (rto_max < max_when)
+			max_when = rto_max;
+	}
+
 	if (when > max_when) {
 		pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",
 			 sk, what, when, (void *)_THIS_IP_);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 5fcd61ada622..09a28a5c94d2 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -178,6 +178,7 @@  struct netns_ipv4 {
 	u8 sysctl_tcp_window_scaling;
 	u8 sysctl_tcp_timestamps;
 	int sysctl_tcp_rto_min_us;
+	int sysctl_tcp_rto_max_us;
 	u8 sysctl_tcp_recovery;
 	u8 sysctl_tcp_thin_linear_timeouts;
 	u8 sysctl_tcp_slow_start_after_idle;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 9140d20eb2d4..304f173837bc 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1573,6 +1573,14 @@  static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "tcp_rto_max_us",
+		.data		= &init_net.ipv4.sysctl_tcp_rto_max_us,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 };
 
 static __net_init int ipv4_sysctl_init_net(struct net *net)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fd17f25ff288..f06859be5942 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3506,6 +3506,7 @@  static int __net_init tcp_sk_init(struct net *net)
 
 	net->ipv4.sysctl_tcp_pingpong_thresh = 1;
 	net->ipv4.sysctl_tcp_rto_min_us = jiffies_to_usecs(TCP_RTO_MIN);
+	net->ipv4.sysctl_tcp_rto_max_us = jiffies_to_usecs(TCP_RTO_MAX);
 
 	return 0;
 }