mbox series

[RFC,net-next,v2,0/2] Make TIME-WAIT reuse delay deterministic and configurable

Message ID 20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com (mailing list archive)
Headers show
Series Make TIME-WAIT reuse delay deterministic and configurable | expand

Message

Jakub Sitnicki Nov. 13, 2024, 10:06 a.m. UTC
This is an iteration on an effort to make the TIME-WAIT reuse delay
shorter, which we have recently presented about at Plumbers [1].

I have addressed the RFCv1 feedback and reworked the changes so that the
scope is limited to just the TIME-WAIT reuse code and does not touch the
PAWS implementation. Please see patch 1 description for the discussion of
this approach.

The end result is that with this patch set the TS.Recent last update
timestamp is interpreted as a millisecond value only in the true TIME-WAIT
state.

I feel that switching the TS.Recent last update timestamp to milliseconds
everywhere in one go would needlessly bundle up the risk of causing
regressions in both TIME-WAIT reuse and PAWS detection code.

Unless there is a strong guidance from the maintainers to do it all at
once, I'd sleep better if we could do it in steps and attack just the
TIME-WAIT reuse first.

This patch set is accompanied by a set of packetdrill tests for both
slow (after 1 sec) and fast (after 1 msec) TIME-WAIT reuse covering happy
and failure scenarios. They can be reviewed at [2]. If the proposed changes
make it into the kernel, I will to submit a PR to the packetdrill repo.

I also plan on adding coverage for PAWS old duplicate detection, as I don't
think we have any in the packetdrill repo. They would be part of the
follow-up changes where we would have to touch the PAWS code to use
milliseconds for TS.Recent last update timestamp everywhere.

We will be rolling these changes out internally to a limited set of
production machines to catch any potential regressions before posting
another iteration. We will report on any findings.

Credit is due for Adrien Vasseur and Lee Valentine for the initial report
on how use of IP_LOCAL_PORT_RANGE when TIME-WAIT reuse delay it up to 1
second long can lead to port exhaustion under connection pressure.

Goes without saying - we are looking for your feedback, while we test this.

Thanks,
-jkbs

[1] https://lpc.events/event/18/contributions/1962/ 
[2] https://github.com/jsitnicki/packetdrill/tree/tw-reuse-tests/rfc2/gtests/net/tcp/ts_recent/tw_reuse

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Changes in RFCv2:
- Make TIME-WAIT reuse configurable through a per-netns sysctl.
- Account for timestamp rounding so delay is not shorter than set value.
- Use tcp_mstamp when we know it is fresh due to receiving a segment.
- Link to RFCv1: https://lore.kernel.org/r/20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com

---
Jakub Sitnicki (2):
      tcp: Measure TIME-WAIT reuse delay with millisecond precision
      tcp: Add sysctl to configure TIME-WAIT reuse delay

 Documentation/networking/ip-sysctl.rst               | 14 ++++++++++++++
 .../networking/net_cachelines/netns_ipv4_sysctl.rst  |  1 +
 include/linux/tcp.h                                  |  9 ++++++++-
 include/net/netns/ipv4.h                             |  1 +
 include/net/tcp.h                                    |  1 +
 net/ipv4/sysctl_net_ipv4.c                           | 10 ++++++++++
 net/ipv4/tcp_ipv4.c                                  |  9 ++++++---
 net/ipv4/tcp_minisocks.c                             | 20 ++++++++++++++------
 8 files changed, 55 insertions(+), 10 deletions(-)