diff mbox series

[RFC,net-next,v2,1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision

Message ID 20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-1-b0a335247304@cloudflare.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Make TIME-WAIT reuse delay deterministic and configurable | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 14 this patch: 14
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 4 maintainers not CCed: kuba@kernel.org dsahern@kernel.org horms@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 27 this patch: 27
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2058 this patch: 2058
netdev/checkpatch warning WARNING: line length of 89 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Sitnicki Nov. 13, 2024, 10:06 a.m. UTC
Prepare ground for TIME-WAIT socket reuse with subsecond delay.

Today TS.Recent last update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.

Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).

For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.

Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.

The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).

In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.

As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.

We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.

Considering the above, change the resolution of the TS.Recent update
timestamp stored in TW socket (tw_ts_recent_stamp) to milliseconds to
(i) reliably delay TIME-WAIT reuse by one second, and (ii) prepare for
configurable subsecond reuse delay in a subsequent change.

Limit the resolution change to just the true TIME-WAIT state, that is when
TW socket is in TCP_TIME_WAIT substate. This approach offers a tradeoff
between the added complexity of converting between time units and the risk
of touching both TIME-WAIT reuse and PAWS mechanism code paths at once. At
the same time, it leaves the path to fully converting TS.Recent update
timestamp to milliseconds open.

A low effort alternative would be to introduce a new field to hold a
millisecond timestamp for measuring the TW reuse delay. However, this would
cause the struct tcp_timewait_socket size to go over 256 bytes and overflow
into another cache line.

[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/tcp.h      |  9 ++++++++-
 net/ipv4/tcp_ipv4.c      |  6 +++---
 net/ipv4/tcp_minisocks.c | 20 ++++++++++++++------
 3 files changed, 25 insertions(+), 10 deletions(-)
diff mbox series

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f88daaa76d836654b2a2e217d0d744d3713d368e..3844ccb2a1fa7eb5e96b466681a0652cadec9354 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -542,7 +542,14 @@  struct tcp_timewait_sock {
 	/* The time we sent the last out-of-window ACK: */
 	u32			  tw_last_oow_ack_time;
 
-	int			  tw_ts_recent_stamp;
+	/**
+	 * @tw_ts_recent_stamp: Timestamp of last TS.Recent update (RFC 7323).
+	 *
+	 * Timestamp resolution depends on @tw_sk.tw_substate state. Has second
+	 * resolution in %TCP_FIN_WAIT2 state and millisecond resolution
+	 * %TCP_TIME_WAIT state.
+	 */
+	u32			  tw_ts_recent_stamp;
 	u32			  tw_tx_delay;
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key	  *tw_md5_key;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a38c8b1f44dbd95fcea08bd81e0ceaa70177ac8a..501e9265b6ebab475ae0a957175286fb153918e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -119,7 +119,7 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	const struct inet_timewait_sock *tw = inet_twsk(sktw);
 	const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
 	struct tcp_sock *tp = tcp_sk(sk);
-	int ts_recent_stamp;
+	u32 ts_recent_stamp;
 
 	if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2)
 		reuse = 0;
@@ -163,8 +163,8 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	 */
 	ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
 	if (ts_recent_stamp &&
-	    (!twp || (reuse && time_after32(ktime_get_seconds(),
-					    ts_recent_stamp)))) {
+	    (!twp || (reuse && time_after32(tcp_clock_ms(),
+					    ts_recent_stamp + MSEC_PER_SEC)))) {
 		/* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk
 		 * and releasing the bucket lock.
 		 */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867ac3ed8610ceb9fef7e74cd465b3ea..6d7e3c974d2ae4fd9e147d8fa222e4c20728b896 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -104,8 +104,10 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	bool paws_reject = false;
 	int ts_recent_stamp;
+	u8 tw_substate;
 
 	tmp_opt.saw_tstamp = 0;
+	tw_substate = READ_ONCE(tw->tw_substate);
 	ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
 	if (th->doff > (sizeof(*th) >> 2) && ts_recent_stamp) {
 		tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL);
@@ -114,12 +116,15 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 			if (tmp_opt.rcv_tsecr)
 				tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
 			tmp_opt.ts_recent	= READ_ONCE(tcptw->tw_ts_recent);
-			tmp_opt.ts_recent_stamp	= ts_recent_stamp;
+			if (tw_substate == TCP_TIME_WAIT)
+				tmp_opt.ts_recent_stamp = ts_recent_stamp / MSEC_PER_SEC;
+			else
+				tmp_opt.ts_recent_stamp	= ts_recent_stamp;
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
 	}
 
-	if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2) {
+	if (tw_substate == TCP_FIN_WAIT2) {
 		/* Just repeat all the checks of tcp_rcv_state_process() */
 
 		/* Out of window, send ACK */
@@ -158,7 +163,7 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 
 		if (tmp_opt.saw_tstamp) {
 			WRITE_ONCE(tcptw->tw_ts_recent_stamp,
-				  ktime_get_seconds());
+				  tcp_clock_ms());
 			WRITE_ONCE(tcptw->tw_ts_recent,
 				   tmp_opt.rcv_tsval);
 		}
@@ -207,7 +212,7 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 			WRITE_ONCE(tcptw->tw_ts_recent,
 				   tmp_opt.rcv_tsval);
 			WRITE_ONCE(tcptw->tw_ts_recent_stamp,
-				   ktime_get_seconds());
+				   tcp_clock_ms());
 		}
 
 		inet_twsk_put(tw);
@@ -320,8 +325,11 @@  void tcp_time_wait(struct sock *sk, int state, int timeo)
 		tcptw->tw_snd_nxt	= tp->snd_nxt;
 		tcptw->tw_rcv_wnd	= tcp_receive_window(tp);
 		tcptw->tw_ts_recent	= tp->rx_opt.ts_recent;
-		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
-		tcptw->tw_ts_offset	= tp->tsoffset;
+		if (state == TCP_TIME_WAIT && tp->rx_opt.ts_recent_stamp)
+			tcptw->tw_ts_recent_stamp = tcp_time_stamp_ms(tp);
+		else
+			tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+		tcptw->tw_ts_offset = tp->tsoffset;
 		tw->tw_usec_ts		= tp->tcp_usec_ts;
 		tcptw->tw_last_oow_ack_time = 0;
 		tcptw->tw_tx_delay	= tp->tcp_tx_delay;