From patchwork Wed Nov 13 10:06:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Sitnicki X-Patchwork-Id: 13873398 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F9431FB8A8 for ; Wed, 13 Nov 2024 10:07:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731492464; cv=none; b=fTKcO01O7tuUtpddJVbs8vqmdFEvaMemcxICrkk7HCZvtZr5mnmI2iqjKfTaAMa2as0saQtJrRFNLv/dnesz4J1uIWMsiw+rpLnlEznqmxqUmpsfF/U7wb466iAABoKwa3Wg1NuOTZI3lLdXQZPrOecKFHqmKO5mWgx0cno5TPs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731492464; c=relaxed/simple; bh=GWG6gvtmCPC1Y5uRJ9mbu+122kW9cf3X2bQVkcO5Vms=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=BGqJ98tWu9w67VdTNijybb6PaNPLpwfI5HJMVim3JjQ+kxvPRDJXOUO3+ogmKejft5eZ5+szOmWmXkBMw+xrpqFINIt1FBZFmspEK5k9c7ViPgE+eV8MHpX8ZzgYF6kHLmKtnWPnMShvnTFOo2eint097zEcgpzXcImZaI+ZOcU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com; spf=pass smtp.mailfrom=cloudflare.com; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b=KcygCHPe; arc=none smtp.client-ip=209.85.208.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="KcygCHPe" Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-5cefc36c5d4so8927466a12.0 for ; Wed, 13 Nov 2024 02:07:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google09082023; t=1731492460; x=1732097260; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=7inWUNM4zxYB/M1nB//FcjspkiJ3ANQGOmabnSGUMdE=; b=KcygCHPeCZrubH1oeDgzZJa2ccP5jX+MTwqK74ZnX2nX91jcQjw0ZD1YIOBm3Mj00f Gc1/48m9fdOZivCNfvAd+sqtcsuFLJdzDqx29ox6KdRFEZiWq6BMeaK1W7l/Zg7Q3FGo +RZN5cBl6O90HbPkWUBO/P7gHqmsHFlJrP+pQrnQVBFDPZmvgDtjc73Ip3+k8By8Xp0T 3pc3WWXMLbvPq+xCCLfskebNhuSZeP9HrqBjIBl9it67h7f07mnCvOYUSO1Q3Qpi/wm2 f/Jc+SCEZUql3Kd1Kruv9AlrQj9YSMSbYNILS2zLGDPGJKGi/c7KA1tZXeqCXgRN5Q73 /S5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731492460; x=1732097260; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7inWUNM4zxYB/M1nB//FcjspkiJ3ANQGOmabnSGUMdE=; b=hUVQONko3hdbiD9/njzaNkbCsvzonUFct/O8Nd9eS/BCLMPhnUUy5NzYAJTIX2azMI QdaJf0WixMaI7BcYnc9mQ9EIxZdZbdHk8koQR+SNQcH+9WgRw2VkHaHXcAvpyz7dvdup 3Mt2N27h9qfqWy3oGXrL9ozxjtkAUQvyxmZWgYhLtXH8UdRok33vmA9k0CIVd/paogeO wnCzFQ7w3jEJW16p9BOOfD5P8duD8oRL3TJbC8HSlxhAKeO1AksWlxaWlztlQ45b7Aoo e5hFjiuUkD0rciwMgQbI17Ue6iT9XH8F0onY6KUK7gFPHBVMyyvLlhYXlrsxGt9+/fh9 2W+Q== X-Gm-Message-State: AOJu0YzlQ8ToSZ9CAaKG1Zqq2/Bm4XM6RI0sV/kBXa4swHx31tVd8pL/ XwaR/Uw6kNAog+nfqJpYGwyjavehGM3ijnWhPQ48XtWi7GEMPGT1kR88ocUnS7c= X-Google-Smtp-Source: AGHT+IFjnWw1i0Y0bWeyKljWjPn7GFPKN/DAnTahfOW/Wi179tDstWNE0W/uYMSmHD/XN5wFzaU+LQ== X-Received: by 2002:a17:907:9344:b0:a9e:c263:29a7 with SMTP id a640c23a62f3a-a9eefeecd74mr1675605166b.23.1731492460401; Wed, 13 Nov 2024 02:07:40 -0800 (PST) Received: from cloudflare.com ([2a09:bac5:506b:2dc::49:c0]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9ee0a18b15sm841722566b.18.2024.11.13.02.07.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Nov 2024 02:07:39 -0800 (PST) From: Jakub Sitnicki Date: Wed, 13 Nov 2024 11:06:42 +0100 Subject: [PATCH RFC net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-1-b0a335247304@cloudflare.com> References: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com> In-Reply-To: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com> To: netdev@vger.kernel.org Cc: Eric Dumazet , Jason Xing , Adrien Vasseur , Lee Valentine , kernel-team@cloudflare.com X-Mailer: b4 0.15-dev-355e8 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Prepare ground for TIME-WAIT socket reuse with subsecond delay. Today TS.Recent last update timestamp, recorded in seconds and stored tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes. Firstly, it is used to track the age of the last recorded TS.Recent value to detect when that value becomes outdated due to potential wrap-around of the other TCP timestamp clock (RFC 7323, section 5.5). For this purpose a second-based timestamp is completely sufficient as even in the worst case scenario of a peer using a high resolution microsecond timestamp, the wrap-around interval is ~36 minutes long. Secondly, it serves as a threshold value for allowing TIME-WAIT socket reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock, ktime_get_seconds, is past the TS.Recent update timestamp. The purpose behind delaying the TIME-WAIT socket reuse is to wait for the other TCP timestamp clock to tick at least once before reusing the connection. It is only then that the PAWS mechanism for the reopened connection can detect old duplicate segments from the previous connection incarnation (RFC 7323, appendix B.2). In this case using a timestamp with second resolution not only blocks the way toward allowing faster TIME-WAIT reuse after shorter subsecond delay, but also makes it impossible to reliably delay TW reuse by one second. As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on average. We delay TW reuse for one full second only when last TS.Recent update coincides with our virtual 1 Hz clock tick. We assume here that a full one second delay was the original intention in [2] because it accounts for the worst case scenario of the other TCP using the slowest recommended 1 Hz timestamp clock. Considering the above, change the resolution of the TS.Recent update timestamp stored in TW socket (tw_ts_recent_stamp) to milliseconds to (i) reliably delay TIME-WAIT reuse by one second, and (ii) prepare for configurable subsecond reuse delay in a subsequent change. Limit the resolution change to just the true TIME-WAIT state, that is when TW socket is in TCP_TIME_WAIT substate. This approach offers a tradeoff between the added complexity of converting between time units and the risk of touching both TIME-WAIT reuse and PAWS mechanism code paths at once. At the same time, it leaves the path to fully converting TS.Recent update timestamp to milliseconds open. A low effort alternative would be to introduce a new field to hold a millisecond timestamp for measuring the TW reuse delay. However, this would cause the struct tcp_timewait_socket size to go over 256 bytes and overflow into another cache line. [1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af Signed-off-by: Jakub Sitnicki --- include/linux/tcp.h | 9 ++++++++- net/ipv4/tcp_ipv4.c | 6 +++--- net/ipv4/tcp_minisocks.c | 20 ++++++++++++++------ 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f88daaa76d836654b2a2e217d0d744d3713d368e..3844ccb2a1fa7eb5e96b466681a0652cadec9354 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -542,7 +542,14 @@ struct tcp_timewait_sock { /* The time we sent the last out-of-window ACK: */ u32 tw_last_oow_ack_time; - int tw_ts_recent_stamp; + /** + * @tw_ts_recent_stamp: Timestamp of last TS.Recent update (RFC 7323). + * + * Timestamp resolution depends on @tw_sk.tw_substate state. Has second + * resolution in %TCP_FIN_WAIT2 state and millisecond resolution + * %TCP_TIME_WAIT state. + */ + u32 tw_ts_recent_stamp; u32 tw_tx_delay; #ifdef CONFIG_TCP_MD5SIG struct tcp_md5sig_key *tw_md5_key; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a38c8b1f44dbd95fcea08bd81e0ceaa70177ac8a..501e9265b6ebab475ae0a957175286fb153918e6 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -119,7 +119,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) const struct inet_timewait_sock *tw = inet_twsk(sktw); const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); struct tcp_sock *tp = tcp_sk(sk); - int ts_recent_stamp; + u32 ts_recent_stamp; if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2) reuse = 0; @@ -163,8 +163,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) */ ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); if (ts_recent_stamp && - (!twp || (reuse && time_after32(ktime_get_seconds(), - ts_recent_stamp)))) { + (!twp || (reuse && time_after32(tcp_clock_ms(), + ts_recent_stamp + MSEC_PER_SEC)))) { /* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk * and releasing the bucket lock. */ diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index bb1fe1ba867ac3ed8610ceb9fef7e74cd465b3ea..6d7e3c974d2ae4fd9e147d8fa222e4c20728b896 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -104,8 +104,10 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, struct tcp_options_received tmp_opt; bool paws_reject = false; int ts_recent_stamp; + u8 tw_substate; tmp_opt.saw_tstamp = 0; + tw_substate = READ_ONCE(tw->tw_substate); ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); if (th->doff > (sizeof(*th) >> 2) && ts_recent_stamp) { tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL); @@ -114,12 +116,15 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, if (tmp_opt.rcv_tsecr) tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset; tmp_opt.ts_recent = READ_ONCE(tcptw->tw_ts_recent); - tmp_opt.ts_recent_stamp = ts_recent_stamp; + if (tw_substate == TCP_TIME_WAIT) + tmp_opt.ts_recent_stamp = ts_recent_stamp / MSEC_PER_SEC; + else + tmp_opt.ts_recent_stamp = ts_recent_stamp; paws_reject = tcp_paws_reject(&tmp_opt, th->rst); } } - if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2) { + if (tw_substate == TCP_FIN_WAIT2) { /* Just repeat all the checks of tcp_rcv_state_process() */ /* Out of window, send ACK */ @@ -158,7 +163,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, if (tmp_opt.saw_tstamp) { WRITE_ONCE(tcptw->tw_ts_recent_stamp, - ktime_get_seconds()); + tcp_clock_ms()); WRITE_ONCE(tcptw->tw_ts_recent, tmp_opt.rcv_tsval); } @@ -207,7 +212,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, WRITE_ONCE(tcptw->tw_ts_recent, tmp_opt.rcv_tsval); WRITE_ONCE(tcptw->tw_ts_recent_stamp, - ktime_get_seconds()); + tcp_clock_ms()); } inet_twsk_put(tw); @@ -320,8 +325,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) tcptw->tw_snd_nxt = tp->snd_nxt; tcptw->tw_rcv_wnd = tcp_receive_window(tp); tcptw->tw_ts_recent = tp->rx_opt.ts_recent; - tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp; - tcptw->tw_ts_offset = tp->tsoffset; + if (state == TCP_TIME_WAIT && tp->rx_opt.ts_recent_stamp) + tcptw->tw_ts_recent_stamp = tcp_time_stamp_ms(tp); + else + tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp; + tcptw->tw_ts_offset = tp->tsoffset; tw->tw_usec_ts = tp->tcp_usec_ts; tcptw->tw_last_oow_ack_time = 0; tcptw->tw_tx_delay = tp->tcp_tx_delay; From patchwork Wed Nov 13 10:06:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Sitnicki X-Patchwork-Id: 13873399 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C7A11FBC85 for ; Wed, 13 Nov 2024 10:07:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731492467; cv=none; b=XPYNQlPO7Rjq1kBTL8VPNlCtjEvoBZm734+CWAEITOne1mWlQmhW2TKuF0V8sKAyZOBACtk2HgdjfpO2lTSCrv3mcVUeCqWK6OmGDYTZ5XOEwRO4sh+xzPYfO7FL80xWST/pmA6Y9z6zzjZAbKoin1O49HTFE53OkXESAHAUgSw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731492467; c=relaxed/simple; bh=2Dd9MJrbvfn3iBa9MjCc87XVVMynLmWbf+pMYQtQ+N8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=pzMpyGrEXHUz5zxFugxzGN+cl5f4R081x9VvRMDTn/2W3sHZ+fGeDBpf+1zo5hcdBVdxkKjuvyPY5H4sn0YyV/hM607Cznp2pI2PPt1SUfigUDwb6i0ocvhNPdeh+Kqc1iT/qI+Hdht228U4gRT0t3oQHR11cs6BdRjRYNqIxWA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com; spf=pass smtp.mailfrom=cloudflare.com; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b=UnSxlqbc; arc=none smtp.client-ip=209.85.218.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cloudflare.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="UnSxlqbc" Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-a9a0c7abaa6so862606566b.2 for ; Wed, 13 Nov 2024 02:07:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google09082023; t=1731492463; x=1732097263; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=aSlxB19YJO8xt936wecAE4P2Sab0s89QlyjcV+GkNfE=; b=UnSxlqbc2mm5l9ah3mj3FOEj21nQORHDQs21MWeAksjSmcUo9KL5qDgWnHrqsuQ/Ch f8CcTVw3khPZvAdtSfLjpYvhmoosz+wmOJVf4e7w20tD5gWBewtTJWHleBA+zkeeSzXA b4MM68shIXeH0rbHqM0xic0ulbpx6krHkI1ZD+ZqjycsB+BWNhGJ4aw4oC/HVF03BEcU yR+gdGyL9n5N6761JqYwssOlmmvtAI25Qia6ragvy/Iwhs39Ts05mikYHqLfZEUYFeYm L0hUv06I4QYxgXGJLwLIaAeTY4cUQoXXpxLQNBDhn9rUlTlKHCcKtMxKRiOOwEaDpn2Z LX1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731492463; x=1732097263; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aSlxB19YJO8xt936wecAE4P2Sab0s89QlyjcV+GkNfE=; b=FheVOqoY/OQnzcHVTyTmwbyF08ejoF235mz24MHgRau4o9V4jNCSqDTGUwb3SE0/0R ySCGnd4LB4GtBnsQl0WAeG2G+8AAZXkDOhCXhpkNkSYdZDZolbNcpgSEdlLafUZMf8wU VoOVAsX5Uk0vAxh4fabL7RehNZCzzlOTyg1FeWqnxppuC9pCc0P6UoZu1hr8Ok1/Ae5h 6uZ1ZBRvHuAk8nMLpvRze31Q/vwTU893HMpiSx2T7PFckiPkzyUipfxqo3YvAFftJS40 yVfp7HtkHGQ9vdEx92X+eK2wiWsZ2CmVidn7Sdoc22OTz4pXiyTpuLhczxmFHo3APWim JWdQ== X-Gm-Message-State: AOJu0YymLyEZQ3MrLuU2WEtSc2cS994/X2nRcXXbUupnvbOFmJ+S9A3u +Zee5bx79mcwdlRx8dvdlUHRSyW6Nk+hHADx95IXOtzVroDWqDuzkLo0oUaK+28= X-Google-Smtp-Source: AGHT+IGYHFIxCEr29qMddSkGDNc3/M3bDaVdc3VAuEgNnJTNSE1EIAM5hKbVTSow69ghoNshnJ79Ew== X-Received: by 2002:a17:907:9688:b0:a9a:ea4:2834 with SMTP id a640c23a62f3a-a9eeff38820mr1860347066b.33.1731492462630; Wed, 13 Nov 2024 02:07:42 -0800 (PST) Received: from cloudflare.com ([2a09:bac5:506b:2dc::49:c0]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9ee0a4c0dbsm860250166b.66.2024.11.13.02.07.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Nov 2024 02:07:41 -0800 (PST) From: Jakub Sitnicki Date: Wed, 13 Nov 2024 11:06:43 +0100 Subject: [PATCH RFC net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-2-b0a335247304@cloudflare.com> References: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com> In-Reply-To: <20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com> To: netdev@vger.kernel.org Cc: Eric Dumazet , Jason Xing , Adrien Vasseur , Lee Valentine , kernel-team@cloudflare.com X-Mailer: b4 0.15-dev-355e8 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be reused by reopening a connection. This is a safe choice based on an assumption that the other TCP timestamp clock frequency, which is unknown to us, may be as low as 1 Hz (RFC 7323, section 5.4). However, this means that in the presence of short lived connections with an RTT of couple of milliseconds, the time during which a 4-tuple is blocked from reuse can be orders of magnitude longer that the connection lifetime. Combined with a reduced pool of ephemeral ports, when using IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the long TIME-WAIT reuse delay can lead to port exhaustion, where all available 4-tuples are tied up in TIME-WAIT state. Make the reuse delay configurable so that sysadmins can make more aggressive assumptions about remote TCP timestamp clock frequency and shorten the delay in order to allow connections to reincarnate faster. Note that applications can completely bypass the TIME-WAIT delay protection already today by locking the local port with bind() before connecting. Such immediate connection reuse may result in PAWS failing to detect old duplicate segments, leaving us with just the sequence number check as a safety net. This new configurable offers a trade off where the sysadmin can balance between the risk of PAWS detection failures versus exhausting ports by having sockets tied up in TIME-WAIT state for too long. [1] https://lpc.events/event/16/contributions/1349/ Signed-off-by: Jakub Sitnicki --- Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++ .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 + include/net/netns/ipv4.h | 1 + include/net/tcp.h | 1 + net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ net/ipv4/tcp_ipv4.c | 5 ++++- 6 files changed, 31 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index eacf8983e2307476895a8def7363375f2af36d9d..2f2b00295836be80e1da11370022ca083d7d1eb2 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1000,6 +1000,20 @@ tcp_tw_reuse - INTEGER Default: 2 +tcp_tw_reuse_delay - UNSIGNED INTEGER + The delay in milliseconds before a TIME-WAIT socket can be reused by a + new connection, if TIME-WAIT socket reuse is enabled. The actual reuse + threshold is within [N, N+1] range, where N is the requested delay in + milliseconds, to ensure the delay interval is never shorter than the + configured value. + + This setting contains an assumption about the other TCP timestamp clock + tick interval. It should not be set to a value lower than the peer's + clock tick for PAWS (Protection Against Wrapped Sequence numbers) + mechanism work correctly for the reused connection. + + Default: 1000 (milliseconds) + tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst index 629da6dc6d746ce8058cfbe2215d33d55ca4c19d..de0263302f16dd815593671c4f75a93ed6f7cac4 100644 --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst @@ -79,6 +79,7 @@ u8 sysctl_tcp_retries1 u8 sysctl_tcp_retries2 u8 sysctl_tcp_orphan_retries u8 sysctl_tcp_tw_reuse timewait_sock_ops +unsigned_int sysctl_tcp_tw_reuse_delay timewait_sock_ops int sysctl_tcp_fin_timeout TCP_LAST_ACK/tcp_rcv_state_process unsigned_int sysctl_tcp_notsent_lowat read_mostly tcp_notsent_lowat/tcp_stream_memory_free u8 sysctl_tcp_sack tcp_syn_options diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 3c014170e0012818db36d4a7a327025e3fa00dd1..46452da352061007d19d00fdacddd25bbe56444d 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -175,6 +175,7 @@ struct netns_ipv4 { u8 sysctl_tcp_retries2; u8 sysctl_tcp_orphan_retries; u8 sysctl_tcp_tw_reuse; + unsigned int sysctl_tcp_tw_reuse_delay; int sysctl_tcp_fin_timeout; u8 sysctl_tcp_sack; u8 sysctl_tcp_window_scaling; diff --git a/include/net/tcp.h b/include/net/tcp.h index e9b37b76e894bb8d6715bbea4006d9157085cf5c..e1119f0a476b5cc2396af669bb28f91ac654d7b2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -185,6 +185,7 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX); * timestamps. It must be less than * minimal timewait lifetime. */ + /* * TCP option */ diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index a79b2a52ce01e6c1a1257ba31c17ac2f51ba19ec..42cb5dc9cb245c26f9a38f8c8c4b26b1adddca39 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -45,6 +45,7 @@ static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024; static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX; static int tcp_plb_max_rounds = 31; static int tcp_plb_max_cong_thresh = 256; +static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC; /* obsolete */ static int sysctl_tcp_low_latency __read_mostly; @@ -1065,6 +1066,15 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "tcp_tw_reuse_delay", + .data = &init_net.ipv4.sysctl_tcp_tw_reuse_delay, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_douintvec_minmax, + .extra1 = SYSCTL_ONE, + .extra2 = &tcp_tw_reuse_delay_max, + }, { .procname = "tcp_max_syn_backlog", .data = &init_net.ipv4.sysctl_max_syn_backlog, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 501e9265b6ebab475ae0a957175286fb153918e6..40ef91cb3eb511cc19bb96c6fd648cf8f8175655 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -120,6 +120,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); struct tcp_sock *tp = tcp_sk(sk); u32 ts_recent_stamp; + u32 reuse_delay; if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2) reuse = 0; @@ -162,9 +163,10 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) and use initial timestamp retrieved from peer table. */ ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); + reuse_delay = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tw_reuse_delay); if (ts_recent_stamp && (!twp || (reuse && time_after32(tcp_clock_ms(), - ts_recent_stamp + MSEC_PER_SEC)))) { + ts_recent_stamp + reuse_delay)))) { /* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk * and releasing the bucket lock. */ @@ -3457,6 +3459,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT; net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX; net->ipv4.sysctl_tcp_tw_reuse = 2; + net->ipv4.sysctl_tcp_tw_reuse_delay = 1 * MSEC_PER_SEC; net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1; refcount_set(&net->ipv4.tcp_death_row.tw_refcount, 1);