Message ID | 20231015174700.2206872-1-ncardwell.sw@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 1c2709cfff1dedbb9591e989e2f001484208d914 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp: fix excessive TLP and RACK timeouts from HZ rounding | expand |
On Sun, Oct 15, 2023 at 7:47 PM Neal Cardwell <ncardwell.sw@gmail.com> wrote: > > From: Neal Cardwell <ncardwell@google.com> > > We discovered from packet traces of slow loss recovery on kernels with > the default HZ=250 setting (and min_rtt < 1ms) that after reordering, > when receiving a SACKed sequence range, the RACK reordering timer was > firing after about 16ms rather than the desired value of roughly > min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer > calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels > with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the > exact same issue. > > This commit fixes the TLP transmit timer and RACK reordering timer > floor calculation to more closely match the intended 2ms floor even on > kernels with HZ=250. It does this by adding in a new > TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies, > instead of the current approach of converting to jiffies and then > adding th TCP_TIMEOUT_MIN value of 2 jiffies. > > Our testing has verified that on kernels with HZ=1000, as expected, > this does not produce significant changes in behavior, but on kernels > with the default HZ=250 the latency improvement can be large. For > example, our tests show that for HZ=250 kernels at low RTTs this fix > roughly halves the latency for the RACK reorder timer: instead of > mostly firing at 16ms it mostly fires at 8ms. > > Suggested-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > Signed-off-by: Yuchung Cheng <ycheng@google.com> > Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout") Reviewed-by: Eric Dumazet <edumazet@google.com> It is a bit sad that some distros are still using HZ=250 in 2023. Thanks Neal !
Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Sun, 15 Oct 2023 13:47:00 -0400 you wrote: > From: Neal Cardwell <ncardwell@google.com> > > We discovered from packet traces of slow loss recovery on kernels with > the default HZ=250 setting (and min_rtt < 1ms) that after reordering, > when receiving a SACKed sequence range, the RACK reordering timer was > firing after about 16ms rather than the desired value of roughly > min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer > calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels > with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the > exact same issue. > > [...] Here is the summary with links: - [net] tcp: fix excessive TLP and RACK timeouts from HZ rounding https://git.kernel.org/netdev/net/c/1c2709cfff1d You are awesome, thank you!
diff --git a/include/net/tcp.h b/include/net/tcp.h index 7b1a720691aec..4b03ca7cb8a5e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -141,6 +141,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); #define TCP_RTO_MAX ((unsigned)(120*HZ)) #define TCP_RTO_MIN ((unsigned)(HZ/5)) #define TCP_TIMEOUT_MIN (2U) /* Min timeout for TCP timers in jiffies */ + +#define TCP_TIMEOUT_MIN_US (2*USEC_PER_MSEC) /* Min TCP timeout in microsecs */ + #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC6298 2.1 initial RTO value */ #define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now * used as a fallback RTO for the diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9c8c42c280b76..bbd85672fda70 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2788,7 +2788,7 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); - u32 timeout, rto_delta_us; + u32 timeout, timeout_us, rto_delta_us; int early_retrans; /* Don't do any loss probe on a Fast Open connection before 3WHS @@ -2812,11 +2812,12 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) * sample is available then probe after TCP_TIMEOUT_INIT. */ if (tp->srtt_us) { - timeout = usecs_to_jiffies(tp->srtt_us >> 2); + timeout_us = tp->srtt_us >> 2; if (tp->packets_out == 1) - timeout += TCP_RTO_MIN; + timeout_us += tcp_rto_min_us(sk); else - timeout += TCP_TIMEOUT_MIN; + timeout_us += TCP_TIMEOUT_MIN_US; + timeout = usecs_to_jiffies(timeout_us); } else { timeout = TCP_TIMEOUT_INIT; } diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index acf4869c5d3b5..bba10110fbbc1 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -104,7 +104,7 @@ bool tcp_rack_mark_lost(struct sock *sk) tp->rack.advanced = 0; tcp_rack_detect_loss(sk, &timeout); if (timeout) { - timeout = usecs_to_jiffies(timeout) + TCP_TIMEOUT_MIN; + timeout = usecs_to_jiffies(timeout + TCP_TIMEOUT_MIN_US); inet_csk_reset_xmit_timer(sk, ICSK_TIME_REO_TIMEOUT, timeout, inet_csk(sk)->icsk_rto); }