Message ID | 20230920172943.4135513-4-edumazet@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: add tcp_delack_max() | expand |
On Wed, Sep 20, 2023 at 1:29 PM Eric Dumazet <edumazet@google.com> wrote: > > While BPF allows to set icsk->->icsk_delack_max nit pick in the commit message: redundant ->->. > and/or icsk->icsk_rto_min, we have an ip route > attribute (RTAX_RTO_MIN) to be able to tune rto_min, > but nothing to consequently adjust max delayed ack, > which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). > > This makes RTAX_RTO_MIN of almost no practical use, > unless customers are in big trouble. > > Modern days datacenter communications want to set > rto_min to ~5 ms, and the max delayed ack one jiffie > smaller to avoid spurious retransmits. > > After this patch, an "rto_min 5" route attribute will > effectively lower max delayed ack timers to 4 ms. > > Note in the following ss output, "rto:6 ... ato:4" > > $ ss -temoi dst XXXXXX > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 > ino:255134 sk:1001 <-> > skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack > cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 > rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 > bytes_received:54823120 segs_out:1370582 segs_in:1370580 > data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps > pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 > busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 > rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 > > While we could argue this patch fixes a bug with RTAX_RTO_MIN, > I do not add a Fixes: tag, so that we can soak it a bit before > asking backports to stable branches. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> So great that this is now upstream :-) Thank you! > --- > include/net/tcp.h | 2 ++ > net/ipv4/tcp.c | 3 ++- > net/ipv4/tcp_output.c | 16 +++++++++++++++- > 3 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index a8db7d43fb6215197af4a80e270b8c82070d55cb..af9cb37fbe53ec60b4e545e8aa0740bbf30da7b6 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -718,6 +718,8 @@ static inline void tcp_fast_path_check(struct sock *sk) > tcp_fast_path_on(tp); > } > > +u32 tcp_delack_max(const struct sock *sk); > + > /* Compute the actual rto_min value */ > static inline u32 tcp_rto_min(const struct sock *sk) > { > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 69b8d707370844020770438cc4f31aeda4830b3d..e54f91eb943b2f09f303951cc72cbea61ada519d 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -3762,7 +3762,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) > info->tcpi_options |= TCPI_OPT_SYN_DATA; > > info->tcpi_rto = jiffies_to_usecs(icsk->icsk_rto); > - info->tcpi_ato = jiffies_to_usecs(icsk->icsk_ack.ato); > + info->tcpi_ato = jiffies_to_usecs(min(icsk->icsk_ack.ato, > + tcp_delack_max(sk))); > info->tcpi_snd_mss = tp->mss_cache; > info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss; > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..2d1e4b5ac1ca41ff3db8dc58458d4e922a2c4999 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3977,6 +3977,20 @@ int tcp_connect(struct sock *sk) > } > EXPORT_SYMBOL(tcp_connect); > > +u32 tcp_delack_max(const struct sock *sk) > +{ > + const struct dst_entry *dst = __sk_dst_get(sk); > + u32 delack_max = inet_csk(sk)->icsk_delack_max; > + > + if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) { > + u32 rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); > + u32 delack_from_rto_min = max_t(int, 1, rto_min - 1); > + > + delack_max = min_t(u32, delack_max, delack_from_rto_min); > + } > + return delack_max; > +} > + > /* Send out a delayed ack, the caller does the policy checking > * to see if we should even be here. See tcp_input.c:tcp_ack_snd_check() > * for details. > @@ -4012,7 +4026,7 @@ void tcp_send_delayed_ack(struct sock *sk) > ato = min(ato, max_ato); > } > > - ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); > + ato = min_t(u32, ato, tcp_delack_max(sk)); > > /* Stay within the limit we were given */ > timeout = jiffies + ato; > -- > 2.42.0.459.ge4e396fd5e-goog >
On Wed, Sep 20, 2023 at 1:29 PM Eric Dumazet <edumazet@google.com> wrote: > > While BPF allows to set icsk->->icsk_delack_max > and/or icsk->icsk_rto_min, we have an ip route > attribute (RTAX_RTO_MIN) to be able to tune rto_min, > but nothing to consequently adjust max delayed ack, > which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). > > This makes RTAX_RTO_MIN of almost no practical use, > unless customers are in big trouble. > > Modern days datacenter communications want to set > rto_min to ~5 ms, and the max delayed ack one jiffie > smaller to avoid spurious retransmits. > > After this patch, an "rto_min 5" route attribute will > effectively lower max delayed ack timers to 4 ms. > > Note in the following ss output, "rto:6 ... ato:4" > > $ ss -temoi dst XXXXXX > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 > ino:255134 sk:1001 <-> > skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack > cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 > rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 > bytes_received:54823120 segs_out:1370582 segs_in:1370580 > data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps > pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 > busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 > rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 > > While we could argue this patch fixes a bug with RTAX_RTO_MIN, > I do not add a Fixes: tag, so that we can soak it a bit before > asking backports to stable branches. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- Acked-by: Neal Cardwell <ncardwell@google.com> Thanks, Eric! neal
On 9/20/23 11:29 AM, Eric Dumazet wrote: > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..2d1e4b5ac1ca41ff3db8dc58458d4e922a2c4999 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3977,6 +3977,20 @@ int tcp_connect(struct sock *sk) > } > EXPORT_SYMBOL(tcp_connect); > > +u32 tcp_delack_max(const struct sock *sk) > +{ > + const struct dst_entry *dst = __sk_dst_get(sk); > + u32 delack_max = inet_csk(sk)->icsk_delack_max; > + > + if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) { > + u32 rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); > + u32 delack_from_rto_min = max_t(int, 1, rto_min - 1); `u32` type with max_t type set as `int` > + > + delack_max = min_t(u32, delack_max, delack_from_rto_min); > + } > + return delack_max; > +} > + > /* Send out a delayed ack, the caller does the policy checking > * to see if we should even be here. See tcp_input.c:tcp_ack_snd_check() > * for details. > @@ -4012,7 +4026,7 @@ void tcp_send_delayed_ack(struct sock *sk) > ato = min(ato, max_ato); > } > > - ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); > + ato = min_t(u32, ato, tcp_delack_max(sk)); and then here ato is an `int`. > > /* Stay within the limit we were given */ > timeout = jiffies + ato;
On Wed, Sep 20, 2023 at 11:57 PM David Ahern <dsahern@kernel.org> wrote: > > On 9/20/23 11:29 AM, Eric Dumazet wrote: > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..2d1e4b5ac1ca41ff3db8dc58458d4e922a2c4999 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -3977,6 +3977,20 @@ int tcp_connect(struct sock *sk) > > } > > EXPORT_SYMBOL(tcp_connect); > > > > +u32 tcp_delack_max(const struct sock *sk) > > +{ > > + const struct dst_entry *dst = __sk_dst_get(sk); > > + u32 delack_max = inet_csk(sk)->icsk_delack_max; > > + > > + if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) { > > + u32 rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); > > + u32 delack_from_rto_min = max_t(int, 1, rto_min - 1); > > `u32` type with max_t type set as `int` That is because we allow "rto_min 0" in ip route ... rto_min - 1 is then 0xFFFFFFFF We could argue that "rto_min 0" would be illegal, but this is orthogonal. > > > + > > + delack_max = min_t(u32, delack_max, delack_from_rto_min); > > + } > > + return delack_max; > > +} > > + > > /* Send out a delayed ack, the caller does the policy checking > > * to see if we should even be here. See tcp_input.c:tcp_ack_snd_check() > > * for details. > > @@ -4012,7 +4026,7 @@ void tcp_send_delayed_ack(struct sock *sk) > > ato = min(ato, max_ato); > > } > > > > - ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); > > + ato = min_t(u32, ato, tcp_delack_max(sk)); > > and then here ato is an `int`. > > > > /* Stay within the limit we were given */ > > timeout = jiffies + ato; >
On 9/20/23 8:16 PM, Eric Dumazet wrote: > On Wed, Sep 20, 2023 at 11:57 PM David Ahern <dsahern@kernel.org> wrote: >> >> On 9/20/23 11:29 AM, Eric Dumazet wrote: >>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c >>> index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..2d1e4b5ac1ca41ff3db8dc58458d4e922a2c4999 100644 >>> --- a/net/ipv4/tcp_output.c >>> +++ b/net/ipv4/tcp_output.c >>> @@ -3977,6 +3977,20 @@ int tcp_connect(struct sock *sk) >>> } >>> EXPORT_SYMBOL(tcp_connect); >>> >>> +u32 tcp_delack_max(const struct sock *sk) >>> +{ >>> + const struct dst_entry *dst = __sk_dst_get(sk); >>> + u32 delack_max = inet_csk(sk)->icsk_delack_max; >>> + >>> + if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) { >>> + u32 rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); >>> + u32 delack_from_rto_min = max_t(int, 1, rto_min - 1); >> >> `u32` type with max_t type set as `int` > > That is because we allow "rto_min 0" in ip route ... > > rto_min - 1 is then 0xFFFFFFFF > > We could argue that "rto_min 0" would be illegal, but this is orthogonal. My comment is solely about mismatch on data types. I am surprised use of max_t with mixed data types does not throw a compiler warning.
On Thu, Sep 21, 2023 at 2:37 PM David Ahern <dsahern@kernel.org> wrote: > > My comment is solely about mismatch on data types. I am surprised use of > max_t with mixed data types does not throw a compiler warning. This was intentional. This is max_t() purpose really. Perhaps you are thinking about max() ?
From: Eric Dumazet > Sent: 21 September 2023 13:58 > > On Thu, Sep 21, 2023 at 2:37 PM David Ahern <dsahern@kernel.org> wrote: > > > > > My comment is solely about mismatch on data types. I am surprised use of > > max_t with mixed data types does not throw a compiler warning. > > This was intentional. > > This is max_t() purpose really. Apart from when it gets used to accidentally mask high bits :-) (Although hat is usually consigned to min_t()). Here u32 delack_from_rto_min = max(rto_min, 2u) - 1; would probably be safer (as in have no casts that might have unwanted side effects). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Sep 22, 2023 at 11:59 AM David Laight <David.Laight@aculab.com> wrote: > > From: Eric Dumazet > > Sent: 21 September 2023 13:58 > > > > On Thu, Sep 21, 2023 at 2:37 PM David Ahern <dsahern@kernel.org> wrote: > > > > > > > > My comment is solely about mismatch on data types. I am surprised use of > > > max_t with mixed data types does not throw a compiler warning. > > > > This was intentional. > > > > This is max_t() purpose really. > > Apart from when it gets used to accidentally mask high bits :-) > (Although hat is usually consigned to min_t()). As explained, this is not an accident, but a conscious decision I made. > > Here > u32 delack_from_rto_min = max(rto_min, 2u) - 1; > would probably be safer (as in have no casts that might have > unwanted side effects). > I find my solution more readable. max(-1, 1) is 1. If this was not the case, many things would be broken in the kernel.
From: Eric Dumazet > Sent: 22 September 2023 11:53 > > On Fri, Sep 22, 2023 at 11:59 AM David Laight <David.Laight@aculab.com> wrote: > > > > From: Eric Dumazet > > > Sent: 21 September 2023 13:58 > > > > > > On Thu, Sep 21, 2023 at 2:37 PM David Ahern <dsahern@kernel.org> wrote: > > > > > > > > > > > My comment is solely about mismatch on data types. I am surprised use of > > > > max_t with mixed data types does not throw a compiler warning. > > > > > > This was intentional. > > > > > > This is max_t() purpose really. > > > > Apart from when it gets used to accidentally mask high bits :-) > > (Although hat is usually consigned to min_t()). > > As explained, this is not an accident, but a conscious decision I made. > > > > > Here > > u32 delack_from_rto_min = max(rto_min, 2u) - 1; > > would probably be safer (as in have no casts that might have > > unwanted side effects). > > > > I find my solution more readable. It has to be said I didn't really like mine either :-) A better alternative would be: max((int)rto_min - 1, 1) to make it absolutely clear what is going on. Far too many of the min_t() and max_t() are just used to silence the over-enthusiastic type checking of min() and max(). So code doing something different might be best making it more obvious. Does 'ip route' stop very large values for rto_min? Unsigned values with the top bit set might cause 'interesting' behaviour! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/include/net/tcp.h b/include/net/tcp.h index a8db7d43fb6215197af4a80e270b8c82070d55cb..af9cb37fbe53ec60b4e545e8aa0740bbf30da7b6 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -718,6 +718,8 @@ static inline void tcp_fast_path_check(struct sock *sk) tcp_fast_path_on(tp); } +u32 tcp_delack_max(const struct sock *sk); + /* Compute the actual rto_min value */ static inline u32 tcp_rto_min(const struct sock *sk) { diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 69b8d707370844020770438cc4f31aeda4830b3d..e54f91eb943b2f09f303951cc72cbea61ada519d 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3762,7 +3762,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) info->tcpi_options |= TCPI_OPT_SYN_DATA; info->tcpi_rto = jiffies_to_usecs(icsk->icsk_rto); - info->tcpi_ato = jiffies_to_usecs(icsk->icsk_ack.ato); + info->tcpi_ato = jiffies_to_usecs(min(icsk->icsk_ack.ato, + tcp_delack_max(sk))); info->tcpi_snd_mss = tp->mss_cache; info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc..2d1e4b5ac1ca41ff3db8dc58458d4e922a2c4999 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3977,6 +3977,20 @@ int tcp_connect(struct sock *sk) } EXPORT_SYMBOL(tcp_connect); +u32 tcp_delack_max(const struct sock *sk) +{ + const struct dst_entry *dst = __sk_dst_get(sk); + u32 delack_max = inet_csk(sk)->icsk_delack_max; + + if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) { + u32 rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); + u32 delack_from_rto_min = max_t(int, 1, rto_min - 1); + + delack_max = min_t(u32, delack_max, delack_from_rto_min); + } + return delack_max; +} + /* Send out a delayed ack, the caller does the policy checking * to see if we should even be here. See tcp_input.c:tcp_ack_snd_check() * for details. @@ -4012,7 +4026,7 @@ void tcp_send_delayed_ack(struct sock *sk) ato = min(ato, max_ato); } - ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); + ato = min_t(u32, ato, tcp_delack_max(sk)); /* Stay within the limit we were given */ timeout = jiffies + ato;
While BPF allows to set icsk->->icsk_delack_max and/or icsk->icsk_rto_min, we have an ip route attribute (RTAX_RTO_MIN) to be able to tune rto_min, but nothing to consequently adjust max delayed ack, which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). This makes RTAX_RTO_MIN of almost no practical use, unless customers are in big trouble. Modern days datacenter communications want to set rto_min to ~5 ms, and the max delayed ack one jiffie smaller to avoid spurious retransmits. After this patch, an "rto_min 5" route attribute will effectively lower max delayed ack timers to 4 ms. Note in the following ss output, "rto:6 ... ato:4" $ ss -temoi dst XXXXXX State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 ino:255134 sk:1001 <-> skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 bytes_received:54823120 segs_out:1370582 segs_in:1370580 data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 While we could argue this patch fixes a bug with RTAX_RTO_MIN, I do not add a Fixes: tag, so that we can soak it a bit before asking backports to stable branches. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/tcp.h | 2 ++ net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_output.c | 16 +++++++++++++++- 3 files changed, 19 insertions(+), 2 deletions(-)