diff mbox series

[net-next,v2,1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision

Message ID 20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-1-66aca0eed03e@cloudflare.com (mailing list archive)
State Accepted
Commit 19ce8cd3046587efbd2c6253947be7c22dfccc18
Delegated to: Netdev Maintainers
Headers show
Series Make TIME-WAIT reuse delay deterministic and configurable | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 13 this patch: 13
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 2 maintainers not CCed: dsahern@kernel.org horms@kernel.org
netdev/build_clang success Errors and warnings before: 26 this patch: 26
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2057 this patch: 2057
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 48 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-12-12--00-00 (tests: 795)

Commit Message

Jakub Sitnicki Dec. 9, 2024, 7:38 p.m. UTC
Prepare ground for TIME-WAIT socket reuse with subsecond delay.

Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.

Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).

For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.

Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.

The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).

In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.

As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.

Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.

Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.

We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.

A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.

[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_timewait_sock.h | 4 ++++
 net/ipv4/tcp_ipv4.c              | 5 +++--
 net/ipv4/tcp_minisocks.c         | 7 ++++++-
 3 files changed, 13 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Dec. 10, 2024, 8:11 a.m. UTC | #1
On Mon, Dec 9, 2024 at 8:38 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare ground for TIME-WAIT socket reuse with subsecond delay.
>
> Today the last TS.Recent update timestamp, recorded in seconds and stored
> tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
>
> Firstly, it is used to track the age of the last recorded TS.Recent value
> to detect when that value becomes outdated due to potential wrap-around of
> the other TCP timestamp clock (RFC 7323, section 5.5).
>
> For this purpose a second-based timestamp is completely sufficient as even
> in the worst case scenario of a peer using a high resolution microsecond
> timestamp, the wrap-around interval is ~36 minutes long.
>
> Secondly, it serves as a threshold value for allowing TIME-WAIT socket
> reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
> ktime_get_seconds, is past the TS.Recent update timestamp.
>
> The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
> other TCP timestamp clock to tick at least once before reusing the
> connection. It is only then that the PAWS mechanism for the reopened
> connection can detect old duplicate segments from the previous connection
> incarnation (RFC 7323, appendix B.2).
>
> In this case using a timestamp with second resolution not only blocks the
> way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
> but also makes it impossible to reliably delay TW reuse by one second.
>
> As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
> reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
> average. We delay TW reuse for one full second only when last TS.Recent
> update coincides with our virtual 1 Hz clock tick.
>
> Considering the above, introduce a dedicated field to store a millisecond
> timestamp of transition into the TIME-WAIT state. Place it in an existing
> 4-byte hole inside inet_timewait_sock structure to avoid an additional
> memory cost.
>
> Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
> and (ii) prepare for configurable subsecond reuse delay in the subsequent
> change.
>
> We assume here that a full one second delay was the original intention in
> [2] because it accounts for the worst case scenario of the other TCP using
> the slowest recommended 1 Hz timestamp clock.
>
> A more involved alternative would be to change the resolution of the last
> TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
>
> [1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>
Jason Xing Dec. 12, 2024, 1 a.m. UTC | #2
On Tue, Dec 10, 2024 at 3:38 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare ground for TIME-WAIT socket reuse with subsecond delay.
>
> Today the last TS.Recent update timestamp, recorded in seconds and stored
> tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
>
> Firstly, it is used to track the age of the last recorded TS.Recent value
> to detect when that value becomes outdated due to potential wrap-around of
> the other TCP timestamp clock (RFC 7323, section 5.5).
>
> For this purpose a second-based timestamp is completely sufficient as even
> in the worst case scenario of a peer using a high resolution microsecond
> timestamp, the wrap-around interval is ~36 minutes long.
>
> Secondly, it serves as a threshold value for allowing TIME-WAIT socket
> reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
> ktime_get_seconds, is past the TS.Recent update timestamp.
>
> The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
> other TCP timestamp clock to tick at least once before reusing the
> connection. It is only then that the PAWS mechanism for the reopened
> connection can detect old duplicate segments from the previous connection
> incarnation (RFC 7323, appendix B.2).
>
> In this case using a timestamp with second resolution not only blocks the
> way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
> but also makes it impossible to reliably delay TW reuse by one second.
>
> As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
> reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
> average. We delay TW reuse for one full second only when last TS.Recent
> update coincides with our virtual 1 Hz clock tick.
>
> Considering the above, introduce a dedicated field to store a millisecond
> timestamp of transition into the TIME-WAIT state. Place it in an existing
> 4-byte hole inside inet_timewait_sock structure to avoid an additional
> memory cost.
>
> Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
> and (ii) prepare for configurable subsecond reuse delay in the subsequent
> change.
>
> We assume here that a full one second delay was the original intention in
> [2] because it accounts for the worst case scenario of the other TCP using
> the slowest recommended 1 Hz timestamp clock.
>
> A more involved alternative would be to change the resolution of the last
> TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
>
> [1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks for your effort!
diff mbox series

Patch

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 62c0a7e65d6bdf4c71a8ea90586b985f9fd30229..67a313575780992a1b55aa26aaa2055111eb7e8d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -74,6 +74,10 @@  struct inet_timewait_sock {
 				tw_tos		: 8;
 	u32			tw_txhash;
 	u32			tw_priority;
+	/**
+	 * @tw_reuse_stamp: Time of entry into %TCP_TIME_WAIT state in msec.
+	 */
+	u32			tw_entry_stamp;
 	struct timer_list	tw_timer;
 	struct inet_bind_bucket	*tw_tb;
 	struct inet_bind2_bucket	*tw_tb2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a38c8b1f44dbd95fcea08bd81e0ceaa70177ac8a..3b6ba1d16921e079d5ba08c3c0b98dccace8c370 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -120,6 +120,7 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int ts_recent_stamp;
+	u32 reuse_thresh;
 
 	if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2)
 		reuse = 0;
@@ -162,9 +163,9 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	   and use initial timestamp retrieved from peer table.
 	 */
 	ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
+	reuse_thresh = READ_ONCE(tw->tw_entry_stamp) + MSEC_PER_SEC;
 	if (ts_recent_stamp &&
-	    (!twp || (reuse && time_after32(ktime_get_seconds(),
-					    ts_recent_stamp)))) {
+	    (!twp || (reuse && time_after32(tcp_clock_ms(), reuse_thresh)))) {
 		/* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk
 		 * and releasing the bucket lock.
 		 */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 7121d8573928cbf6840b3361b62f4812d365a30b..b089b08e9617862cd73b47ac06b5ac6c1e843ec6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -157,8 +157,11 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 				    rcv_nxt);
 
 		if (tmp_opt.saw_tstamp) {
+			u64 ts = tcp_clock_ms();
+
+			WRITE_ONCE(tw->tw_entry_stamp, ts);
 			WRITE_ONCE(tcptw->tw_ts_recent_stamp,
-				  ktime_get_seconds());
+				   div_u64(ts, MSEC_PER_SEC));
 			WRITE_ONCE(tcptw->tw_ts_recent,
 				   tmp_opt.rcv_tsval);
 		}
@@ -316,6 +319,8 @@  void tcp_time_wait(struct sock *sk, int state, int timeo)
 		tw->tw_mark		= sk->sk_mark;
 		tw->tw_priority		= READ_ONCE(sk->sk_priority);
 		tw->tw_rcv_wscale	= tp->rx_opt.rcv_wscale;
+		/* refreshed when we enter true TIME-WAIT state */
+		tw->tw_entry_stamp	= tcp_time_stamp_ms(tp);
 		tcptw->tw_rcv_nxt	= tp->rcv_nxt;
 		tcptw->tw_snd_nxt	= tp->snd_nxt;
 		tcptw->tw_rcv_wnd	= tcp_receive_window(tp);