Message ID | 20240715033118.32322-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: introduce rto_max_us sysctl knob | expand |
On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > As we all know, the algorithm of rto is exponential backoff as RFC > defined long time ago. This is weak sentence. Please provide RFC numbers instead. > After several rounds of repeatedly transmitting > a lost skb, the expiry of rto timer could reach above 1 second within > the upper bound (120s). This is confusing. What do you mean exactly ? > > Waiting more than one second to retransmit for some latency-sensitive > application is a little bit unacceptable nowadays, so I decided to > introduce a sysctl knob to allow users to tune it. Still, the maximum > value is 120 seconds. I do not think this sysctl needs usec resolution. Also storing this sysctl once, and converting it millions of times per second to jiffies is not good. I suggest you use proc_dointvec_jiffies() instead in the sysctl handler. Minimal value of one jiffies makes also no sense. We can not predict if some distros/users might (ab)use this sysctl. You forgot to update Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst This means the location you chose for the new sysctl is pretty much random and not reflcting this is used in one fast path. I suggest you wait for net-next being reopened, we are all busy attending netdev 0x18 conference.
Hello Eric, On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote: > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > As we all know, the algorithm of rto is exponential backoff as RFC > > defined long time ago. > > This is weak sentence. Please provide RFC numbers instead. RFC 6298. I will update it. > > > After several rounds of repeatedly transmitting > > a lost skb, the expiry of rto timer could reach above 1 second within > > the upper bound (120s). > > This is confusing. What do you mean exactly ? I will rewrite this part. What I was trying to say is that waiting more than 1 second is not very friendly to some applications, especially the expiry time can reach 120 seconds which is too long. > > > > > Waiting more than one second to retransmit for some latency-sensitive > > application is a little bit unacceptable nowadays, so I decided to > > introduce a sysctl knob to allow users to tune it. Still, the maximum > > value is 120 seconds. > > I do not think this sysctl needs usec resolution. Are you suggesting using jiffies is enough? But I have two reasons: 1) Keep the consistency with rto_min_us 2) If rto_min_us can be set to a very small value, why not rto_max? What do you think? > > Also storing this sysctl once, and converting it millions of times per > second to jiffies is not good. > I suggest you use proc_dointvec_jiffies() instead in the sysctl handler. > > Minimal value of one jiffies makes also no sense. We can not predict > if some distros/users > might (ab)use this sysctl. Okay. If the final solution is using jiffies, I will accordingly adjust it. > > You forgot to update > Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst Oh sorry, I forgot. > This means the location you chose for the new sysctl is pretty much > random and not reflcting > this is used in one fast path. I will investigate its proper location... > > I suggest you wait for net-next being reopened, we are all busy > attending netdev 0x18 conference. Roger that. Thanks for your suggestions. Thanks, Jason
On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > As we all know, the algorithm of rto is exponential backoff as RFC > > > defined long time ago. > > > > This is weak sentence. Please provide RFC numbers instead. > > RFC 6298. I will update it. > > > > > > After several rounds of repeatedly transmitting > > > a lost skb, the expiry of rto timer could reach above 1 second within > > > the upper bound (120s). > > > > This is confusing. What do you mean exactly ? > > I will rewrite this part. What I was trying to say is that waiting > more than 1 second is not very friendly to some applications, > especially the expiry time can reach 120 seconds which is too long. Says who ? I think this needs IETF approval. > > > > > > > > > Waiting more than one second to retransmit for some latency-sensitive > > > application is a little bit unacceptable nowadays, so I decided to > > > introduce a sysctl knob to allow users to tune it. Still, the maximum > > > value is 120 seconds. > > > > I do not think this sysctl needs usec resolution. > > Are you suggesting using jiffies is enough? But I have two reasons: > 1) Keep the consistency with rto_min_us > 2) If rto_min_us can be set to a very small value, why not rto_max? rto_max is usually 3 order of magnitude higher than rto_min For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms. Some of us use 5 msec rto_min. jiffies is plain enough for rto_max. > > What do you think? I think you missed many many details really. Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask yourself how many things will break if we allow a sysctl value with 1 second for rto_max. > > > > > Also storing this sysctl once, and converting it millions of times per > > second to jiffies is not good. > > I suggest you use proc_dointvec_jiffies() instead in the sysctl handler. > > > > Minimal value of one jiffies makes also no sense. We can not predict > > if some distros/users > > might (ab)use this sysctl. > > Okay. If the final solution is using jiffies, I will accordingly adjust it. > > > > > You forgot to update > > Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > > Oh sorry, I forgot. > > > This means the location you chose for the new sysctl is pretty much > > random and not reflcting > > this is used in one fast path. > > I will investigate its proper location... > > > > > I suggest you wait for net-next being reopened, we are all busy > > attending netdev 0x18 conference. > > Roger that. Thanks for your suggestions. > > Thanks, > Jason
On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > As we all know, the algorithm of rto is exponential backoff as RFC > > > > defined long time ago. > > > > > > This is weak sentence. Please provide RFC numbers instead. > > > > RFC 6298. I will update it. > > > > > > > > > After several rounds of repeatedly transmitting > > > > a lost skb, the expiry of rto timer could reach above 1 second within > > > > the upper bound (120s). > > > > > > This is confusing. What do you mean exactly ? > > > > I will rewrite this part. What I was trying to say is that waiting > > more than 1 second is not very friendly to some applications, > > especially the expiry time can reach 120 seconds which is too long. > > Says who ? I think this needs IETF approval. Did you get me wrong? I mean this rto_max is the same as rto_min_us, which can be tuned by users. > > > > > > > > > > > > > > Waiting more than one second to retransmit for some latency-sensitive > > > > application is a little bit unacceptable nowadays, so I decided to > > > > introduce a sysctl knob to allow users to tune it. Still, the maximum > > > > value is 120 seconds. > > > > > > I do not think this sysctl needs usec resolution. > > > > Are you suggesting using jiffies is enough? But I have two reasons: > > 1) Keep the consistency with rto_min_us > > 2) If rto_min_us can be set to a very small value, why not rto_max? > > rto_max is usually 3 order of magnitude higher than rto_min > > For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms. > Some of us use 5 msec rto_min. > > jiffies is plain enough for rto_max. I got it. Thanks. > > > > > > What do you think? > > I think you missed many many details really. > > Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask > yourself how many things > will break if we allow a sysctl value with 1 second for rto_max. I'm not modifying the TCP_RTO_MAX value which is tooooo complicated. Instead, I'm trying to control the maximum expiry time in the ICSK_TIME_RETRANS timer. So it's only involved in three cases: 1) syn retrans 2) synack retrans 3) data retrans Thanks, Jason
On Tue, Jul 16, 2024 at 3:14 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > Hello Eric, > > > > > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > As we all know, the algorithm of rto is exponential backoff as RFC > > > > > defined long time ago. > > > > > > > > This is weak sentence. Please provide RFC numbers instead. > > > > > > RFC 6298. I will update it. > > > > > > > > > > > > After several rounds of repeatedly transmitting > > > > > a lost skb, the expiry of rto timer could reach above 1 second within > > > > > the upper bound (120s). > > > > > > > > This is confusing. What do you mean exactly ? > > > > > > I will rewrite this part. What I was trying to say is that waiting > > > more than 1 second is not very friendly to some applications, > > > especially the expiry time can reach 120 seconds which is too long. > > > > Says who ? I think this needs IETF approval. > > Did you get me wrong? I mean this rto_max is the same as rto_min_us, > which can be tuned by users. > > > > > > > > > > > > > > > > > > > > Waiting more than one second to retransmit for some latency-sensitive > > > > > application is a little bit unacceptable nowadays, so I decided to > > > > > introduce a sysctl knob to allow users to tune it. Still, the maximum > > > > > value is 120 seconds. > > > > > > > > I do not think this sysctl needs usec resolution. > > > > > > Are you suggesting using jiffies is enough? But I have two reasons: > > > 1) Keep the consistency with rto_min_us > > > 2) If rto_min_us can be set to a very small value, why not rto_max? > > > > rto_max is usually 3 order of magnitude higher than rto_min > > > > For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms. > > Some of us use 5 msec rto_min. > > > > jiffies is plain enough for rto_max. > > I got it. Thanks. > > > > > > > > > > > What do you think? > > > > I think you missed many many details really. > > > > Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask > > yourself how many things > > will break if we allow a sysctl value with 1 second for rto_max. > > I'm not modifying the TCP_RTO_MAX value which is tooooo complicated. > Instead, I'm trying to control the maximum expiry time in the > ICSK_TIME_RETRANS timer. So it's only involved in three cases: > 1) syn retrans > 2) synack retrans > 3) data retrans To be clearer, my initial goal is to accelerate the speed of retransmitting data. There is a simpler way to implement it only in tcp_retransmit_timer().
On Tue, Jul 16, 2024 at 4:11 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Jul 16, 2024 at 3:14 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Jul 16, 2024 at 2:57 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Jul 15, 2024 at 11:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > Hello Eric, > > > > > > > > On Mon, Jul 15, 2024 at 10:40 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Sun, Jul 14, 2024 at 8:31 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > As we all know, the algorithm of rto is exponential backoff as RFC > > > > > > defined long time ago. > > > > > > > > > > This is weak sentence. Please provide RFC numbers instead. > > > > > > > > RFC 6298. I will update it. > > > > > > > > > > > > > > > After several rounds of repeatedly transmitting > > > > > > a lost skb, the expiry of rto timer could reach above 1 second within > > > > > > the upper bound (120s). > > > > > > > > > > This is confusing. What do you mean exactly ? > > > > > > > > I will rewrite this part. What I was trying to say is that waiting > > > > more than 1 second is not very friendly to some applications, > > > > especially the expiry time can reach 120 seconds which is too long. > > > > > > Says who ? I think this needs IETF approval. > > > > Did you get me wrong? I mean this rto_max is the same as rto_min_us, > > which can be tuned by users. > > > > > > > > > > > > > > > > > > > > > > > > > > Waiting more than one second to retransmit for some latency-sensitive > > > > > > application is a little bit unacceptable nowadays, so I decided to > > > > > > introduce a sysctl knob to allow users to tune it. Still, the maximum > > > > > > value is 120 seconds. > > > > > > > > > > I do not think this sysctl needs usec resolution. > > > > > > > > Are you suggesting using jiffies is enough? But I have two reasons: > > > > 1) Keep the consistency with rto_min_us > > > > 2) If rto_min_us can be set to a very small value, why not rto_max? > > > > > > rto_max is usually 3 order of magnitude higher than rto_min > > > > > > For HZ=100, using jiffies for rto_min would not allow rto_min < 10 ms. > > > Some of us use 5 msec rto_min. > > > > > > jiffies is plain enough for rto_max. > > > > I got it. Thanks. > > > > > > > > > > > > > > > > What do you think? > > > > > > I think you missed many many details really. > > > > > > Look at all TCP_RTO_MAX instances in net/ipv4/tcp_timer.c and ask > > > yourself how many things > > > will break if we allow a sysctl value with 1 second for rto_max. > > > > I'm not modifying the TCP_RTO_MAX value which is tooooo complicated. > > Instead, I'm trying to control the maximum expiry time in the > > ICSK_TIME_RETRANS timer. So it's only involved in three cases: > > 1) syn retrans > > 2) synack retrans > > 3) data retrans > > To be clearer, my initial goal is to accelerate the speed of > retransmitting data. There is a simpler way to implement it only in > tcp_retransmit_timer(). Sorry to keep replying to this thread so frequently. I have to admit tuning rto max is much more complicated than rto min. I think simply reducing the time of retrans is not a feasible way.
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 3616389c8c2d..32a1907ca95d 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1223,6 +1223,16 @@ tcp_rto_min_us - INTEGER Default: 200000 +tcp_rto_max_us - INTEGER + Maximum TCP retransmission timeout (in microseconds). + + The recommended practice is to use a value less or equal to 120000000 + microseconds. + + Possible Values: 1 - INT_MAX + + Default: 120000000 + UDP variables ============= diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index c0deaafebfdc..a0abbafcab9e 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -217,10 +217,18 @@ static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what) */ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, unsigned long when, - const unsigned long max_when) + unsigned long max_when) { struct inet_connection_sock *icsk = inet_csk(sk); + if (what == ICSK_TIME_RETRANS) { + int rto_max_us = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rto_max_us); + unsigned int rto_max = usecs_to_jiffies(rto_max_us); + + if (rto_max < max_when) + max_when = rto_max; + } + if (when > max_when) { pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n", sk, what, when, (void *)_THIS_IP_); diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 5fcd61ada622..09a28a5c94d2 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -178,6 +178,7 @@ struct netns_ipv4 { u8 sysctl_tcp_window_scaling; u8 sysctl_tcp_timestamps; int sysctl_tcp_rto_min_us; + int sysctl_tcp_rto_max_us; u8 sysctl_tcp_recovery; u8 sysctl_tcp_thin_linear_timeouts; u8 sysctl_tcp_slow_start_after_idle; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 9140d20eb2d4..304f173837bc 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1573,6 +1573,14 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, }, + { + .procname = "tcp_rto_max_us", + .data = &init_net.ipv4.sysctl_tcp_rto_max_us, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ONE, + }, }; static __net_init int ipv4_sysctl_init_net(struct net *net) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288..f06859be5942 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3506,6 +3506,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_pingpong_thresh = 1; net->ipv4.sysctl_tcp_rto_min_us = jiffies_to_usecs(TCP_RTO_MIN); + net->ipv4.sysctl_tcp_rto_max_us = jiffies_to_usecs(TCP_RTO_MAX); return 0; }