diff mbox series

[net-next] tcp: change source port selection at bind() time

Message ID 20240816153204.93787-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: change source port selection at bind() time | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 19 this patch: 19
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 59 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-16--21-00 (tests: 710)

Commit Message

Jason Xing Aug. 16, 2024, 3:32 p.m. UTC
From: Jason Xing <kernelxing@tencent.com>

This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
change source port selection at connect() time").

This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
don't need to iterate every two ports which means only favouring odd
number like the old days before 2016, which can be good for some
users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
connect().

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/inet_connection_sock.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

Comments

Eric Dumazet Aug. 19, 2024, 3:45 p.m. UTC | #1
On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
> change source port selection at connect() time").
>
> This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
> don't need to iterate every two ports which means only favouring odd
> number like the old days before 2016, which can be good for some
> users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
> connect().

Except that bind() with a port reservation is not as common as a connect().
This is highly discouraged.

See IP_BIND_ADDRESS_NO_PORT

Can you provide a real use case ?

I really feel like you are trying to push patches 'just because you can'...

'The old days' before 2016 were not very nice, we had P0 all the time
because of port exhaustion.
Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories.
Jason Xing Aug. 19, 2024, 4:11 p.m. UTC | #2
On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
> > change source port selection at connect() time").
> >
> > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
> > don't need to iterate every two ports which means only favouring odd
> > number like the old days before 2016, which can be good for some
> > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
> > connect().
>
> Except that bind() with a port reservation is not as common as a connect().
> This is highly discouraged.
>
> See IP_BIND_ADDRESS_NO_PORT
>
> Can you provide a real use case ?
>
> I really feel like you are trying to push patches 'just because you can'...

You apparently got me wrong and hurt my feelings :/

Since you asked me about this one, I'm going to tell you the whole story.

A few years ago, one of my colleagues reached out to you and
complained about the issue of the new port selection algorithm
(odd/even port selection) to you. But a few years later, the community
finally implemented/extended such IP_LOCAL_PORT_RANGE option to
support finding a suitable port one by one. I'm not sure if you
remembered.

Nearly every month I get issue reports about why the latest kernel
applies a new algorithm because of the high cpu-load increasing so
much suddenly, then I will let them use the older one which I use a
sysctl knob to control internally.

If you pay more attention in these years, you've received more than
one patch trying to use the older algorithm provided for users, even
though they are not accepted. You may ignore the fact. For me, I just
want to provide a way/patch to use the older algorithm for connect()
and bind(), which can be accepted by the community relatively easily.
That's all. I don't believe we are the only special one! So I want to
change something!

Like that 'self-connection' issue, we are also not the only one, so I
exert/push myself to solve the issue thoroughly.

To be honest, whether you accept the patch, I cannot control it, but I
will try my best to do some useful reports and provide real use cases
which are derived from our production.

That's not bad, right?
Jason Xing Aug. 20, 2024, 12:53 a.m. UTC | #3
Hello Eric,

On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
> > change source port selection at connect() time").
> >
> > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
> > don't need to iterate every two ports which means only favouring odd
> > number like the old days before 2016, which can be good for some
> > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
> > connect().
>
> Except that bind() with a port reservation is not as common as a connect().
> This is highly discouraged.
>
> See IP_BIND_ADDRESS_NO_PORT
>
> Can you provide a real use case ?
>
> I really feel like you are trying to push patches 'just because you can'...
>
> 'The old days' before 2016 were not very nice, we had P0 all the time
> because of port exhaustion.
> Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories.

As you mentioned last night, the issues happening in connect() are
relatively more than in bind().

To be more concise, I would like to state 3 points to see if they are valid:
(1) Extending the option for bind() is the last puzzle of using an
older algorithm for some users. Since we have one in connect(), how
about adding it in bind() to provide for the people favouring the
older algorithm.
(2) This patch will not hurt any users like in Google as an example
which prefers odd/even port selection, which is, I admit, indeed more
advanced.
(3) This patch does not come out of thin air, but from some users who I contact.
?

In my opinion, using and adjusting to the new algorithm needs some
changes in applications. For some old applications, they still need
more time to keep pace with a more workable solution.

After considering it a whole night, I would like to push this tiny
feature into the upstream kernel, I wonder if you can help me review
it? Thanks in advance, Eric.

Thanks,
Jason
Kuniyuki Iwashima Aug. 20, 2024, 4:53 a.m. UTC | #4
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Tue, 20 Aug 2024 08:53:53 +0800
> Hello Eric,
> 
> On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
> > > change source port selection at connect() time").
> > >
> > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
> > > don't need to iterate every two ports which means only favouring odd
> > > number like the old days before 2016, which can be good for some
> > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
> > > connect().
> >
> > Except that bind() with a port reservation is not as common as a connect().
> > This is highly discouraged.
> >
> > See IP_BIND_ADDRESS_NO_PORT
> >
> > Can you provide a real use case ?
> >
> > I really feel like you are trying to push patches 'just because you can'...
> >
> > 'The old days' before 2016 were not very nice, we had P0 all the time
> > because of port exhaustion.
> > Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories.
> 
> As you mentioned last night, the issues happening in connect() are
> relatively more than in bind().
> 
> To be more concise, I would like to state 3 points to see if they are valid:
> (1) Extending the option for bind() is the last puzzle of using an
> older algorithm for some users. Since we have one in connect(), how
> about adding it in bind() to provide for the people favouring the
> older algorithm.

Why do they want to use bind() to pick a random port in the first place ?

bind() behaviour is not strictly the same with connect(); the port reserved
by bind() is not reusable for connect().

Also, bind() requires SO_REUSEADDR to share a port, but by default, even
SO_REUSEADDR enabled sockets cannot share the same port if application
uses random-pick by bind((IP, 0)):

  # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
  # python3
  >>> from socket import *
  >>> 
  >>> c1 = socket()
  >>> c1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  >>> c1.bind(('', 0))
  >>> c1
  <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 32768)>
  >>> 
  >>> c2 = socket()
  >>> c2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  >>> c2.bind(('', 0))
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  OSError: [Errno 98] Address already in use

Then, net.ipv4.ip_autobind_reuse needs to be enabled at some risk.

bind()+connect() simply decreases the number of available 4-tuple on
the netns unless all applications use bind()+connect() instead of just
connect(), and it's unlikely.


> (2) This patch will not hurt any users like in Google as an example
> which prefers odd/even port selection, which is, I admit, indeed more
> advanced.

Indeed, it won't hurt existing users but will lead new users to the
wrong way.


> (3) This patch does not come out of thin air, but from some users who I contact.
> ?

Is someone who contacted to you really aware of all of the above and
even then in favor of bind() without IP_BIND_ADDRESS_NO_PORT ?


> In my opinion, using and adjusting to the new algorithm needs some
> changes in applications. For some old applications, they still need
> more time to keep pace with a more workable solution.

They will add setsockopt(IP_LOCAL_PORT_RANGE) whether your patch is
applied or not, then, only thing they need to do is replace SO_REUSEADDR
with IP_BIND_ADDRESS_NO_PORT, simple enough ?


> After considering it a whole night, I would like to push this tiny
> feature into the upstream kernel, I wonder if you can help me review
> it? Thanks in advance, Eric.
Jason Xing Aug. 20, 2024, 5:34 a.m. UTC | #5
On Tue, Aug 20, 2024 at 12:53 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Tue, 20 Aug 2024 08:53:53 +0800
> > Hello Eric,
> >
> > On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp:
> > > > change source port selection at connect() time").
> > > >
> > > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we
> > > > don't need to iterate every two ports which means only favouring odd
> > > > number like the old days before 2016, which can be good for some
> > > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in
> > > > connect().
> > >
> > > Except that bind() with a port reservation is not as common as a connect().
> > > This is highly discouraged.
> > >
> > > See IP_BIND_ADDRESS_NO_PORT
> > >
> > > Can you provide a real use case ?
> > >
> > > I really feel like you are trying to push patches 'just because you can'...
> > >
> > > 'The old days' before 2016 were not very nice, we had P0 all the time
> > > because of port exhaustion.
> > > Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories.
> >
> > As you mentioned last night, the issues happening in connect() are
> > relatively more than in bind().
> >
> > To be more concise, I would like to state 3 points to see if they are valid:
> > (1) Extending the option for bind() is the last puzzle of using an
> > older algorithm for some users. Since we have one in connect(), how
> > about adding it in bind() to provide for the people favouring the
> > older algorithm.
>
> Why do they want to use bind() to pick a random port in the first place ?
>
> bind() behaviour is not strictly the same with connect(); the port reserved
> by bind() is not reusable for connect().
>
> Also, bind() requires SO_REUSEADDR to share a port, but by default, even
> SO_REUSEADDR enabled sockets cannot share the same port if application
> uses random-pick by bind((IP, 0)):
>
>   # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
>   # python3
>   >>> from socket import *
>   >>>
>   >>> c1 = socket()
>   >>> c1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   >>> c1.bind(('', 0))
>   >>> c1
>   <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 32768)>
>   >>>
>   >>> c2 = socket()
>   >>> c2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   >>> c2.bind(('', 0))
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in <module>
>   OSError: [Errno 98] Address already in use
>
> Then, net.ipv4.ip_autobind_reuse needs to be enabled at some risk.
>
> bind()+connect() simply decreases the number of available 4-tuple on
> the netns unless all applications use bind()+connect() instead of just
> connect(), and it's unlikely.
>
>
> > (2) This patch will not hurt any users like in Google as an example
> > which prefers odd/even port selection, which is, I admit, indeed more
> > advanced.
>
> Indeed, it won't hurt existing users but will lead new users to the
> wrong way.

Wrong way? You claim providing an old way for users to prevent the
'breakage' of applications is wrong regardless of those users' habits?

>
>
> > (3) This patch does not come out of thin air, but from some users who I contact.
> > ?
>
> Is someone who contacted to you really aware of all of the above and
> even then in favor of bind() without IP_BIND_ADDRESS_NO_PORT ?
>
>
> > In my opinion, using and adjusting to the new algorithm needs some
> > changes in applications. For some old applications, they still need
> > more time to keep pace with a more workable solution.
>
> They will add setsockopt(IP_LOCAL_PORT_RANGE) whether your patch is
> applied or not, then, only thing they need to do is replace SO_REUSEADDR
> with IP_BIND_ADDRESS_NO_PORT, simple enough ?

From the perspective of kernel developers, yes, it can work for sure.

I'm not discussing why exactly those applications design like this or
not like what we expect, but the fact is they do exist. If I were a
network engineer, I could probably know how to avoid this, but I'm
not. If you insist on asking why they use it like this, I can only
make myself clear by introducing another similar story to help me.

My original way is introducing a sysctl knob which can be tuned by
users to select one of those two algorithms, but it's not acceptable
by the upstream team from what I know. Then, I have to resort to other
ways of letting people have a chance to go back before 2016.

We cannot deny that the connect() using new algo causes some problems
which were missed two or three years ago, which may be considered an
incredible thing now, the same thing could happen again.

Probably we are not on the same page. I'm not unfamiliar with the
'correct' way to handle the issue. Also, I'm not arguing that we
should give up the way you mentioned. The only thing I'm saying is
whether we should provide a method using an old algo or not.

That's all, that's my final try on this patch. If you both disagree,
I'm totally fine with it.

Thanks for both of you getting involved into this discussion. I really
appreciate it :)

Thanks,
Jason
Jason Xing Aug. 20, 2024, 9:22 a.m. UTC | #6
Hello Kuniyuki,

[...]
> > To be more concise, I would like to state 3 points to see if they are valid:
> > (1) Extending the option for bind() is the last puzzle of using an
> > older algorithm for some users. Since we have one in connect(), how
> > about adding it in bind() to provide for the people favouring the
> > older algorithm.
>
> Why do they want to use bind() to pick a random port in the first place ?
>

I feel sorry to bother you again.

Interesting thing is that I just found some of my personal records
that show to me: a lot of applications start active connections using
bind() to find a proper port in Google :) I'm not the only one :p
Probably coming up with the new algo selecting odd/even ports is the
reason/compromise for the bind() and non-bind() users. It gave range/2
for active flow using bind().

That's what I want to share with you :) Hope it will waste you and
Eric precious time :)

In my opinion, whatever the result is, technical communication is
important which can help the community grow. No hard feelings :)

Thanks,
Jason
Jason Xing Aug. 20, 2024, 9:24 a.m. UTC | #7
On Tue, Aug 20, 2024 at 5:22 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Kuniyuki,
>
> [...]
> > > To be more concise, I would like to state 3 points to see if they are valid:
> > > (1) Extending the option for bind() is the last puzzle of using an
> > > older algorithm for some users. Since we have one in connect(), how
> > > about adding it in bind() to provide for the people favouring the
> > > older algorithm.
> >
> > Why do they want to use bind() to pick a random port in the first place ?
> >
>
> I feel sorry to bother you again.
>
> Interesting thing is that I just found some of my personal records
> that show to me: a lot of applications start active connections using
> bind() to find a proper port in Google :) I'm not the only one :p
> Probably coming up with the new algo selecting odd/even ports is the
> reason/compromise for the bind() and non-bind() users. It gave range/2
> for active flow using bind().
>
> That's what I want to share with you :) Hope it will waste you and
> Eric precious time :)

Sorry, it should be "it won't"...
diff mbox series

Patch

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 64d07b842e73..6eb509832c90 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -333,19 +333,20 @@  inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
-	int i, low, high, attempt_half, port, l3mdev;
+	int i, low, high, attempt_half, port, l3mdev, step;
 	struct inet_bind_hashbucket *head, *head2;
+	bool local_ports, relax = false;
 	struct net *net = sock_net(sk);
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
-	bool relax = false;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 ports_exhausted:
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
-	inet_sk_get_local_port_range(sk, &low, &high);
+	local_ports = inet_sk_get_local_port_range(sk, &low, &high);
+	step = local_ports ? 1 : 2;
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	if (high - low < 4)
 		attempt_half = 0;
@@ -358,18 +359,19 @@  inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			low = half;
 	}
 	remaining = high - low;
-	if (likely(remaining > 1))
+	if (!local_ports && remaining > 1)
 		remaining &= ~1U;
 
 	offset = get_random_u32_below(remaining);
 	/* __inet_hash_connect() favors ports having @low parity
 	 * We do the opposite to not pollute connect() users.
 	 */
-	offset |= 1U;
+	if (!local_ports)
+		offset |= 1U;
 
 other_parity_scan:
 	port = low + offset;
-	for (i = 0; i < remaining; i += 2, port += 2) {
+	for (i = 0; i < remaining; i += step, port += step) {
 		if (unlikely(port >= high))
 			port -= remaining;
 		if (inet_is_local_reserved_port(net, port))
@@ -400,9 +402,11 @@  inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 		cond_resched();
 	}
 
-	offset--;
-	if (!(offset & 1))
-		goto other_parity_scan;
+	if (!local_ports) {
+		offset--;
+		if (!(offset & 1))
+			goto other_parity_scan;
+	}
 
 	if (attempt_half == 1) {
 		/* OK we now try the upper half of the range */