diff mbox series

[net] tcp: fix excessive TLP and RACK timeouts from HZ rounding

Message ID 20231015174700.2206872-1-ncardwell.sw@gmail.com (mailing list archive)
State Accepted
Commit 1c2709cfff1dedbb9591e989e2f001484208d914
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: fix excessive TLP and RACK timeouts from HZ rounding | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net, async
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2347 this patch: 2347
netdev/cc_maintainers warning 2 maintainers not CCed: dsahern@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 1505 this patch: 1505
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2398 this patch: 2398
netdev/checkpatch warning CHECK: spaces preferred around that '*' (ctx:VxV)
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Neal Cardwell Oct. 15, 2023, 5:47 p.m. UTC
From: Neal Cardwell <ncardwell@google.com>

We discovered from packet traces of slow loss recovery on kernels with
the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
when receiving a SACKed sequence range, the RACK reordering timer was
firing after about 16ms rather than the desired value of roughly
min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
exact same issue.

This commit fixes the TLP transmit timer and RACK reordering timer
floor calculation to more closely match the intended 2ms floor even on
kernels with HZ=250. It does this by adding in a new
TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
instead of the current approach of converting to jiffies and then
adding th TCP_TIMEOUT_MIN value of 2 jiffies.

Our testing has verified that on kernels with HZ=1000, as expected,
this does not produce significant changes in behavior, but on kernels
with the default HZ=250 the latency improvement can be large. For
example, our tests show that for HZ=250 kernels at low RTTs this fix
roughly halves the latency for the RACK reorder timer: instead of
mostly firing at 16ms it mostly fires at 8ms.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout")
---
 include/net/tcp.h       | 3 +++
 net/ipv4/tcp_output.c   | 9 +++++----
 net/ipv4/tcp_recovery.c | 2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Oct. 16, 2023, 7:57 a.m. UTC | #1
On Sun, Oct 15, 2023 at 7:47 PM Neal Cardwell <ncardwell.sw@gmail.com> wrote:
>
> From: Neal Cardwell <ncardwell@google.com>
>
> We discovered from packet traces of slow loss recovery on kernels with
> the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
> when receiving a SACKed sequence range, the RACK reordering timer was
> firing after about 16ms rather than the desired value of roughly
> min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
> calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
> with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
> exact same issue.
>
> This commit fixes the TLP transmit timer and RACK reordering timer
> floor calculation to more closely match the intended 2ms floor even on
> kernels with HZ=250. It does this by adding in a new
> TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
> instead of the current approach of converting to jiffies and then
> adding th TCP_TIMEOUT_MIN value of 2 jiffies.
>
> Our testing has verified that on kernels with HZ=1000, as expected,
> this does not produce significant changes in behavior, but on kernels
> with the default HZ=250 the latency improvement can be large. For
> example, our tests show that for HZ=250 kernels at low RTTs this fix
> roughly halves the latency for the RACK reorder timer: instead of
> mostly firing at 16ms it mostly fires at 8ms.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout")

Reviewed-by: Eric Dumazet <edumazet@google.com>

It is a bit sad that some distros are still using HZ=250 in 2023.

Thanks Neal !
patchwork-bot+netdevbpf@kernel.org Oct. 18, 2023, 12:40 a.m. UTC | #2
Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sun, 15 Oct 2023 13:47:00 -0400 you wrote:
> From: Neal Cardwell <ncardwell@google.com>
> 
> We discovered from packet traces of slow loss recovery on kernels with
> the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
> when receiving a SACKed sequence range, the RACK reordering timer was
> firing after about 16ms rather than the desired value of roughly
> min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
> calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
> with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
> exact same issue.
> 
> [...]

Here is the summary with links:
  - [net] tcp: fix excessive TLP and RACK timeouts from HZ rounding
    https://git.kernel.org/netdev/net/c/1c2709cfff1d

You are awesome, thank you!
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7b1a720691aec..4b03ca7cb8a5e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -141,6 +141,9 @@  void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
 #define TCP_TIMEOUT_MIN	(2U) /* Min timeout for TCP timers in jiffies */
+
+#define TCP_TIMEOUT_MIN_US (2*USEC_PER_MSEC) /* Min TCP timeout in microsecs */
+
 #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	/* RFC6298 2.1 initial RTO value	*/
 #define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value, now
 						 * used as a fallback RTO for the
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c8c42c280b76..bbd85672fda70 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2788,7 +2788,7 @@  bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 timeout, rto_delta_us;
+	u32 timeout, timeout_us, rto_delta_us;
 	int early_retrans;
 
 	/* Don't do any loss probe on a Fast Open connection before 3WHS
@@ -2812,11 +2812,12 @@  bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
 	 * sample is available then probe after TCP_TIMEOUT_INIT.
 	 */
 	if (tp->srtt_us) {
-		timeout = usecs_to_jiffies(tp->srtt_us >> 2);
+		timeout_us = tp->srtt_us >> 2;
 		if (tp->packets_out == 1)
-			timeout += TCP_RTO_MIN;
+			timeout_us += tcp_rto_min_us(sk);
 		else
-			timeout += TCP_TIMEOUT_MIN;
+			timeout_us += TCP_TIMEOUT_MIN_US;
+		timeout = usecs_to_jiffies(timeout_us);
 	} else {
 		timeout = TCP_TIMEOUT_INIT;
 	}
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index acf4869c5d3b5..bba10110fbbc1 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -104,7 +104,7 @@  bool tcp_rack_mark_lost(struct sock *sk)
 	tp->rack.advanced = 0;
 	tcp_rack_detect_loss(sk, &timeout);
 	if (timeout) {
-		timeout = usecs_to_jiffies(timeout) + TCP_TIMEOUT_MIN;
+		timeout = usecs_to_jiffies(timeout + TCP_TIMEOUT_MIN_US);
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_REO_TIMEOUT,
 					  timeout, inet_csk(sk)->icsk_rto);
 	}