diff mbox series

tcp/dccp: Add another way to allocate local ports in connect()

Message ID 20240729095554.28296-1-xiaolinkui@126.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series tcp/dccp: Add another way to allocate local ports in connect() | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 55 this patch: 56
netdev/build_tools success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang fail Errors and warnings before: 61 this patch: 63
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 965 this patch: 966
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 60 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0

Commit Message

xiaolinkui July 29, 2024, 9:55 a.m. UTC
From: Linkui Xiao <xiaolinkui@kylinos.cn>

Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()") allocates even ports for connect() first while leaving
odd ports for bind() and this works well in busy servers.

But this strategy causes severe performance degradation in busy clients.
when a client has used more than half of the local ports setted in
proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
to a server again, the connect time increases rapidly since it will
traverse all the even ports though they are exhausted.

So this path provides another strategy by introducing a system option:
local_port_allocation. If it is a busy client, users should set it to 1
to use sequential allocation while it should be set to 0 in other
situations. Its default value is 0.

In commit 207184853dbd ("tcp/dccp: change source port selection at
connect() time"), tell users that they can access all odd and even ports
by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
socket application. When even numbered ports are not sufficient, use the
sysctl parameter to achieve the same effect:
	sysctl -w net.ipv4.local_port_allocation=1

Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
---
 include/net/tcp.h          |  1 +
 net/ipv4/inet_hashtables.c | 12 ++++++++----
 net/ipv4/sysctl_net_ipv4.c |  8 ++++++++
 3 files changed, 17 insertions(+), 4 deletions(-)

Comments

Eric Dumazet July 29, 2024, 12:46 p.m. UTC | #1
On Mon, Jul 29, 2024 at 11:56 AM <xiaolinkui@126.com> wrote:
>
> From: Linkui Xiao <xiaolinkui@kylinos.cn>
>
> Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
> in connect()") allocates even ports for connect() first while leaving
> odd ports for bind() and this works well in busy servers.
>
> But this strategy causes severe performance degradation in busy clients.
> when a client has used more than half of the local ports setted in
> proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
> to a server again, the connect time increases rapidly since it will
> traverse all the even ports though they are exhausted.
>
> So this path provides another strategy by introducing a system option:
> local_port_allocation. If it is a busy client, users should set it to 1
> to use sequential allocation while it should be set to 0 in other
> situations. Its default value is 0.
>
> In commit 207184853dbd ("tcp/dccp: change source port selection at
> connect() time"), tell users that they can access all odd and even ports
> by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
> socket application. When even numbered ports are not sufficient, use the
> sysctl parameter to achieve the same effect:
>         sysctl -w net.ipv4.local_port_allocation=1
>
> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>

Too many errors in this patch...

Lack of READ_ONCE() when reading a sysctl.
Lack or per-netns sysctl.
No documentation.
Jason Xing July 29, 2024, 2:17 p.m. UTC | #2
Hi Linkui,

On Mon, Jul 29, 2024 at 5:58 PM <xiaolinkui@126.com> wrote:
>
> From: Linkui Xiao <xiaolinkui@kylinos.cn>
>
> Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
> in connect()") allocates even ports for connect() first while leaving
> odd ports for bind() and this works well in busy servers.
>
> But this strategy causes severe performance degradation in busy clients.
> when a client has used more than half of the local ports setted in
> proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
> to a server again, the connect time increases rapidly since it will
> traverse all the even ports though they are exhausted.
>
> So this path provides another strategy by introducing a system option:
> local_port_allocation. If it is a busy client, users should set it to 1
> to use sequential allocation while it should be set to 0 in other
> situations. Its default value is 0.
>
> In commit 207184853dbd ("tcp/dccp: change source port selection at
> connect() time"), tell users that they can access all odd and even ports
> by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
> socket application. When even numbered ports are not sufficient, use the
> sysctl parameter to achieve the same effect:
>         sysctl -w net.ipv4.local_port_allocation=1
>
> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
> ---
>  include/net/tcp.h          |  1 +
>  net/ipv4/inet_hashtables.c | 12 ++++++++----
>  net/ipv4/sysctl_net_ipv4.c |  8 ++++++++
>  3 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 2aac11e7e1cc..99969b8e5183 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -269,6 +269,7 @@ DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
>
>  extern struct percpu_counter tcp_sockets_allocated;
>  extern unsigned long tcp_memory_pressure;
> +extern bool sysctl_local_port_allocation;
>
>  /* optimized version of sk_under_memory_pressure() for TCP sockets */
>  static inline bool tcp_under_memory_pressure(const struct sock *sk)
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 48d0d494185b..e572f8b21b95 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1020,11 +1020,15 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>         l3mdev = inet_sk_bound_l3mdev(sk);
>
[...]
>         local_ports = inet_sk_get_local_port_range(sk, &low, &high);
> -       step = local_ports ? 1 : 2;
> +       /* local_port_allocation 0 means even and odd port allocation strategy
> +        * will be applied, so step is 2; otherwise sequential allocation will
> +        * be used and step is 1. Default value is 0.
> +        */
> +       step = sysctl_local_port_allocation ? 1 : 2;

Apart from what Eric said, it seems you break the setsockopt
method...? Please look at the 'local_ports'.

Once again, I'm confused about whether we should introduce such a
sysctl knob to do the same thing as an existing method does. Maybe it
suits a private kernel. I'm just saying.

Thanks,
Jason

>
>         high++; /* [32768, 60999] -> [32768, 61000[ */
>         remaining = high - low;
> -       if (!local_ports && remaining > 1)
> +       if (!sysctl_local_port_allocation && remaining > 1)
>                 remaining &= ~1U;
>
>         get_random_sleepable_once(table_perturb,
> @@ -1037,7 +1041,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>         /* In first pass we try ports of @low parity.
>          * inet_csk_get_port() does the opposite choice.
>          */
> -       if (!local_ports)
> +       if (!sysctl_local_port_allocation)
>                 offset &= ~1U;
>  other_parity_scan:
>         port = low + offset;
> @@ -1081,7 +1085,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>                 cond_resched();
>         }
>
> -       if (!local_ports) {
> +       if (!sysctl_local_port_allocation) {
>                 offset++;
>                 if ((offset & 1) && remaining > 1)
>                         goto other_parity_scan;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 9140d20eb2d4..1f6bf3a73516 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -45,6 +45,7 @@ static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
>  static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX;
>  static int tcp_plb_max_rounds = 31;
>  static int tcp_plb_max_cong_thresh = 256;
> +bool sysctl_local_port_allocation;
>
>  /* obsolete */
>  static int sysctl_tcp_low_latency __read_mostly;
> @@ -632,6 +633,13 @@ static struct ctl_table ipv4_table[] = {
>                 .extra1         = &sysctl_fib_sync_mem_min,
>                 .extra2         = &sysctl_fib_sync_mem_max,
>         },
> +       {
> +               .procname       = "local_port_allocation",
> +               .data           = &sysctl_local_port_allocation,
> +               .maxlen         = sizeof(sysctl_local_port_allocation),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dobool,
> +       },
>  };
>
>  static struct ctl_table ipv4_net_table[] = {
> --
> 2.17.1
>
>
Kuniyuki Iwashima July 29, 2024, 3:57 p.m. UTC | #3
From: xiaolinkui@126.com
Date: Mon, 29 Jul 2024 17:55:54 +0800
> From: Linkui Xiao <xiaolinkui@kylinos.cn>
> 
> Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
> in connect()") allocates even ports for connect() first while leaving
> odd ports for bind() and this works well in busy servers.
> 
> But this strategy causes severe performance degradation in busy clients.
> when a client has used more than half of the local ports setted in
> proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
> to a server again, the connect time increases rapidly since it will
> traverse all the even ports though they are exhausted.
> 
> So this path provides another strategy by introducing a system option:
> local_port_allocation. If it is a busy client, users should set it to 1
> to use sequential allocation while it should be set to 0 in other
> situations. Its default value is 0.
> 
> In commit 207184853dbd ("tcp/dccp: change source port selection at
> connect() time"), tell users that they can access all odd and even ports
> by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
> socket application.

The application should be changed, or probably you can put your application
into a cgroup and hook connect() to call bpf_setsockopt() and silently
enable IP_LOCAL_PORT_RANGE.
Eric Dumazet July 29, 2024, 4:16 p.m. UTC | #4
On Mon, Jul 29, 2024 at 5:57 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: xiaolinkui@126.com
> Date: Mon, 29 Jul 2024 17:55:54 +0800
> > From: Linkui Xiao <xiaolinkui@kylinos.cn>
> >
> > Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
> > in connect()") allocates even ports for connect() first while leaving
> > odd ports for bind() and this works well in busy servers.
> >
> > But this strategy causes severe performance degradation in busy clients.
> > when a client has used more than half of the local ports setted in
> > proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
> > to a server again, the connect time increases rapidly since it will
> > traverse all the even ports though they are exhausted.
> >
> > So this path provides another strategy by introducing a system option:
> > local_port_allocation. If it is a busy client, users should set it to 1
> > to use sequential allocation while it should be set to 0 in other
> > situations. Its default value is 0.
> >
> > In commit 207184853dbd ("tcp/dccp: change source port selection at
> > connect() time"), tell users that they can access all odd and even ports
> > by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
> > socket application.
>
> The application should be changed, or probably you can put your application
> into a cgroup and hook connect() to call bpf_setsockopt() and silently
> enable IP_LOCAL_PORT_RANGE.

LD_PRELOAD can also be used for non eBPF users.
Eric Dumazet July 30, 2024, 7:09 a.m. UTC | #5
On Tue, Jul 30, 2024 at 8:38 AM xiaolinkui <xiaolinkui@126.com> wrote:
>
> Thanks for your reply.
>
> At 2024-07-29 19:46:38, "Eric Dumazet" <edumazet@google.com> wrote:
> >On Mon, Jul 29, 2024 at 11:56 AM <xiaolinkui@126.com> wrote:
> >>
> >> From: Linkui Xiao <xiaolinkui@kylinos.cn>
> >>
> >> Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
> >> in connect()") allocates even ports for connect() first while leaving
> >> odd ports for bind() and this works well in busy servers.
> >>
> >> But this strategy causes severe performance degradation in busy clients.
> >> when a client has used more than half of the local ports setted in
> >> proc/sys/net/ipv4/ip_local_port_range, if this client try to connect
> >> to a server again, the connect time increases rapidly since it will
> >> traverse all the even ports though they are exhausted.
> >>
> >> So this path provides another strategy by introducing a system option:
> >> local_port_allocation. If it is a busy client, users should set it to 1
> >> to use sequential allocation while it should be set to 0 in other
> >> situations. Its default value is 0.
> >>
> >> In commit 207184853dbd ("tcp/dccp: change source port selection at
> >> connect() time"), tell users that they can access all odd and even ports
> >> by using IP_LOCAL_PORT_RANGE. But this requires users to modify the
> >> socket application. When even numbered ports are not sufficient, use the
> >> sysctl parameter to achieve the same effect:
> >>         sysctl -w net.ipv4.local_port_allocation=1
> >>
> >> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
> >
> >Too many errors in this patch...
> >
> >Lack of READ_ONCE() when reading a sysctl.
> >Lack or per-netns sysctl.
> >No documentation.
> Yes, it was my negligence.But do you think it's necessary
> for me to send a v2 version?Or should we maintain the previous
> state here and put the sysctl setting method in my private kernel?

I think that adding a sysctl because some applications "can not be
changed" is not convincing.

Another useful socket option is IP_BIND_ADDRESS_NO_PORT, which also is
a choice made by applications.
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2aac11e7e1cc..99969b8e5183 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -269,6 +269,7 @@  DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
 
 extern struct percpu_counter tcp_sockets_allocated;
 extern unsigned long tcp_memory_pressure;
+extern bool sysctl_local_port_allocation;
 
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 48d0d494185b..e572f8b21b95 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1020,11 +1020,15 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
 	local_ports = inet_sk_get_local_port_range(sk, &low, &high);
-	step = local_ports ? 1 : 2;
+	/* local_port_allocation 0 means even and odd port allocation strategy
+	 * will be applied, so step is 2; otherwise sequential allocation will
+	 * be used and step is 1. Default value is 0.
+	 */
+	step = sysctl_local_port_allocation ? 1 : 2;
 
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	remaining = high - low;
-	if (!local_ports && remaining > 1)
+	if (!sysctl_local_port_allocation && remaining > 1)
 		remaining &= ~1U;
 
 	get_random_sleepable_once(table_perturb,
@@ -1037,7 +1041,7 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	/* In first pass we try ports of @low parity.
 	 * inet_csk_get_port() does the opposite choice.
 	 */
-	if (!local_ports)
+	if (!sysctl_local_port_allocation)
 		offset &= ~1U;
 other_parity_scan:
 	port = low + offset;
@@ -1081,7 +1085,7 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		cond_resched();
 	}
 
-	if (!local_ports) {
+	if (!sysctl_local_port_allocation) {
 		offset++;
 		if ((offset & 1) && remaining > 1)
 			goto other_parity_scan;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 9140d20eb2d4..1f6bf3a73516 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -45,6 +45,7 @@  static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
 static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX;
 static int tcp_plb_max_rounds = 31;
 static int tcp_plb_max_cong_thresh = 256;
+bool sysctl_local_port_allocation;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -632,6 +633,13 @@  static struct ctl_table ipv4_table[] = {
 		.extra1		= &sysctl_fib_sync_mem_min,
 		.extra2		= &sysctl_fib_sync_mem_max,
 	},
+	{
+		.procname	= "local_port_allocation",
+		.data		= &sysctl_local_port_allocation,
+		.maxlen		= sizeof(sysctl_local_port_allocation),
+		.mode		= 0644,
+		.proc_handler	= proc_dobool,
+	},
 };
 
 static struct ctl_table ipv4_net_table[] = {