Message ID | 20240816153204.93787-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: change source port selection at bind() time | expand |
On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp: > change source port selection at connect() time"). > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we > don't need to iterate every two ports which means only favouring odd > number like the old days before 2016, which can be good for some > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in > connect(). Except that bind() with a port reservation is not as common as a connect(). This is highly discouraged. See IP_BIND_ADDRESS_NO_PORT Can you provide a real use case ? I really feel like you are trying to push patches 'just because you can'... 'The old days' before 2016 were not very nice, we had P0 all the time because of port exhaustion. Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories.
On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp: > > change source port selection at connect() time"). > > > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we > > don't need to iterate every two ports which means only favouring odd > > number like the old days before 2016, which can be good for some > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in > > connect(). > > Except that bind() with a port reservation is not as common as a connect(). > This is highly discouraged. > > See IP_BIND_ADDRESS_NO_PORT > > Can you provide a real use case ? > > I really feel like you are trying to push patches 'just because you can'... You apparently got me wrong and hurt my feelings :/ Since you asked me about this one, I'm going to tell you the whole story. A few years ago, one of my colleagues reached out to you and complained about the issue of the new port selection algorithm (odd/even port selection) to you. But a few years later, the community finally implemented/extended such IP_LOCAL_PORT_RANGE option to support finding a suitable port one by one. I'm not sure if you remembered. Nearly every month I get issue reports about why the latest kernel applies a new algorithm because of the high cpu-load increasing so much suddenly, then I will let them use the older one which I use a sysctl knob to control internally. If you pay more attention in these years, you've received more than one patch trying to use the older algorithm provided for users, even though they are not accepted. You may ignore the fact. For me, I just want to provide a way/patch to use the older algorithm for connect() and bind(), which can be accepted by the community relatively easily. That's all. I don't believe we are the only special one! So I want to change something! Like that 'self-connection' issue, we are also not the only one, so I exert/push myself to solve the issue thoroughly. To be honest, whether you accept the patch, I cannot control it, but I will try my best to do some useful reports and provide real use cases which are derived from our production. That's not bad, right?
Hello Eric, On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp: > > change source port selection at connect() time"). > > > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we > > don't need to iterate every two ports which means only favouring odd > > number like the old days before 2016, which can be good for some > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in > > connect(). > > Except that bind() with a port reservation is not as common as a connect(). > This is highly discouraged. > > See IP_BIND_ADDRESS_NO_PORT > > Can you provide a real use case ? > > I really feel like you are trying to push patches 'just because you can'... > > 'The old days' before 2016 were not very nice, we had P0 all the time > because of port exhaustion. > Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories. As you mentioned last night, the issues happening in connect() are relatively more than in bind(). To be more concise, I would like to state 3 points to see if they are valid: (1) Extending the option for bind() is the last puzzle of using an older algorithm for some users. Since we have one in connect(), how about adding it in bind() to provide for the people favouring the older algorithm. (2) This patch will not hurt any users like in Google as an example which prefers odd/even port selection, which is, I admit, indeed more advanced. (3) This patch does not come out of thin air, but from some users who I contact. ? In my opinion, using and adjusting to the new algorithm needs some changes in applications. For some old applications, they still need more time to keep pace with a more workable solution. After considering it a whole night, I would like to push this tiny feature into the upstream kernel, I wonder if you can help me review it? Thanks in advance, Eric. Thanks, Jason
From: Jason Xing <kerneljasonxing@gmail.com> Date: Tue, 20 Aug 2024 08:53:53 +0800 > Hello Eric, > > On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp: > > > change source port selection at connect() time"). > > > > > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we > > > don't need to iterate every two ports which means only favouring odd > > > number like the old days before 2016, which can be good for some > > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in > > > connect(). > > > > Except that bind() with a port reservation is not as common as a connect(). > > This is highly discouraged. > > > > See IP_BIND_ADDRESS_NO_PORT > > > > Can you provide a real use case ? > > > > I really feel like you are trying to push patches 'just because you can'... > > > > 'The old days' before 2016 were not very nice, we had P0 all the time > > because of port exhaustion. > > Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories. > > As you mentioned last night, the issues happening in connect() are > relatively more than in bind(). > > To be more concise, I would like to state 3 points to see if they are valid: > (1) Extending the option for bind() is the last puzzle of using an > older algorithm for some users. Since we have one in connect(), how > about adding it in bind() to provide for the people favouring the > older algorithm. Why do they want to use bind() to pick a random port in the first place ? bind() behaviour is not strictly the same with connect(); the port reserved by bind() is not reusable for connect(). Also, bind() requires SO_REUSEADDR to share a port, but by default, even SO_REUSEADDR enabled sockets cannot share the same port if application uses random-pick by bind((IP, 0)): # sysctl -w net.ipv4.ip_local_port_range="32768 32768" # python3 >>> from socket import * >>> >>> c1 = socket() >>> c1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) >>> c1.bind(('', 0)) >>> c1 <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 32768)> >>> >>> c2 = socket() >>> c2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) >>> c2.bind(('', 0)) Traceback (most recent call last): File "<stdin>", line 1, in <module> OSError: [Errno 98] Address already in use Then, net.ipv4.ip_autobind_reuse needs to be enabled at some risk. bind()+connect() simply decreases the number of available 4-tuple on the netns unless all applications use bind()+connect() instead of just connect(), and it's unlikely. > (2) This patch will not hurt any users like in Google as an example > which prefers odd/even port selection, which is, I admit, indeed more > advanced. Indeed, it won't hurt existing users but will lead new users to the wrong way. > (3) This patch does not come out of thin air, but from some users who I contact. > ? Is someone who contacted to you really aware of all of the above and even then in favor of bind() without IP_BIND_ADDRESS_NO_PORT ? > In my opinion, using and adjusting to the new algorithm needs some > changes in applications. For some old applications, they still need > more time to keep pace with a more workable solution. They will add setsockopt(IP_LOCAL_PORT_RANGE) whether your patch is applied or not, then, only thing they need to do is replace SO_REUSEADDR with IP_BIND_ADDRESS_NO_PORT, simple enough ? > After considering it a whole night, I would like to push this tiny > feature into the upstream kernel, I wonder if you can help me review > it? Thanks in advance, Eric.
On Tue, Aug 20, 2024 at 12:53 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Jason Xing <kerneljasonxing@gmail.com> > Date: Tue, 20 Aug 2024 08:53:53 +0800 > > Hello Eric, > > > > On Mon, Aug 19, 2024 at 11:45 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Fri, Aug 16, 2024 at 5:33 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > This is a follow-up patch to an eariler commit 207184853dbd ("tcp/dccp: > > > > change source port selection at connect() time"). > > > > > > > > This patch extends the use of IP_LOCAL_PORT_RANGE option, so that we > > > > don't need to iterate every two ports which means only favouring odd > > > > number like the old days before 2016, which can be good for some > > > > users who want to keep in consistency with IP_LOCAL_PORT_RANGE in > > > > connect(). > > > > > > Except that bind() with a port reservation is not as common as a connect(). > > > This is highly discouraged. > > > > > > See IP_BIND_ADDRESS_NO_PORT > > > > > > Can you provide a real use case ? > > > > > > I really feel like you are trying to push patches 'just because you can'... > > > > > > 'The old days' before 2016 were not very nice, we had P0 all the time > > > because of port exhaustion. > > > Since 2016 and IP_BIND_ADDRESS_NO_PORT I no longer have war rooms stories. > > > > As you mentioned last night, the issues happening in connect() are > > relatively more than in bind(). > > > > To be more concise, I would like to state 3 points to see if they are valid: > > (1) Extending the option for bind() is the last puzzle of using an > > older algorithm for some users. Since we have one in connect(), how > > about adding it in bind() to provide for the people favouring the > > older algorithm. > > Why do they want to use bind() to pick a random port in the first place ? > > bind() behaviour is not strictly the same with connect(); the port reserved > by bind() is not reusable for connect(). > > Also, bind() requires SO_REUSEADDR to share a port, but by default, even > SO_REUSEADDR enabled sockets cannot share the same port if application > uses random-pick by bind((IP, 0)): > > # sysctl -w net.ipv4.ip_local_port_range="32768 32768" > # python3 > >>> from socket import * > >>> > >>> c1 = socket() > >>> c1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) > >>> c1.bind(('', 0)) > >>> c1 > <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 32768)> > >>> > >>> c2 = socket() > >>> c2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) > >>> c2.bind(('', 0)) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > OSError: [Errno 98] Address already in use > > Then, net.ipv4.ip_autobind_reuse needs to be enabled at some risk. > > bind()+connect() simply decreases the number of available 4-tuple on > the netns unless all applications use bind()+connect() instead of just > connect(), and it's unlikely. > > > > (2) This patch will not hurt any users like in Google as an example > > which prefers odd/even port selection, which is, I admit, indeed more > > advanced. > > Indeed, it won't hurt existing users but will lead new users to the > wrong way. Wrong way? You claim providing an old way for users to prevent the 'breakage' of applications is wrong regardless of those users' habits? > > > > (3) This patch does not come out of thin air, but from some users who I contact. > > ? > > Is someone who contacted to you really aware of all of the above and > even then in favor of bind() without IP_BIND_ADDRESS_NO_PORT ? > > > > In my opinion, using and adjusting to the new algorithm needs some > > changes in applications. For some old applications, they still need > > more time to keep pace with a more workable solution. > > They will add setsockopt(IP_LOCAL_PORT_RANGE) whether your patch is > applied or not, then, only thing they need to do is replace SO_REUSEADDR > with IP_BIND_ADDRESS_NO_PORT, simple enough ? From the perspective of kernel developers, yes, it can work for sure. I'm not discussing why exactly those applications design like this or not like what we expect, but the fact is they do exist. If I were a network engineer, I could probably know how to avoid this, but I'm not. If you insist on asking why they use it like this, I can only make myself clear by introducing another similar story to help me. My original way is introducing a sysctl knob which can be tuned by users to select one of those two algorithms, but it's not acceptable by the upstream team from what I know. Then, I have to resort to other ways of letting people have a chance to go back before 2016. We cannot deny that the connect() using new algo causes some problems which were missed two or three years ago, which may be considered an incredible thing now, the same thing could happen again. Probably we are not on the same page. I'm not unfamiliar with the 'correct' way to handle the issue. Also, I'm not arguing that we should give up the way you mentioned. The only thing I'm saying is whether we should provide a method using an old algo or not. That's all, that's my final try on this patch. If you both disagree, I'm totally fine with it. Thanks for both of you getting involved into this discussion. I really appreciate it :) Thanks, Jason
Hello Kuniyuki, [...] > > To be more concise, I would like to state 3 points to see if they are valid: > > (1) Extending the option for bind() is the last puzzle of using an > > older algorithm for some users. Since we have one in connect(), how > > about adding it in bind() to provide for the people favouring the > > older algorithm. > > Why do they want to use bind() to pick a random port in the first place ? > I feel sorry to bother you again. Interesting thing is that I just found some of my personal records that show to me: a lot of applications start active connections using bind() to find a proper port in Google :) I'm not the only one :p Probably coming up with the new algo selecting odd/even ports is the reason/compromise for the bind() and non-bind() users. It gave range/2 for active flow using bind(). That's what I want to share with you :) Hope it will waste you and Eric precious time :) In my opinion, whatever the result is, technical communication is important which can help the community grow. No hard feelings :) Thanks, Jason
On Tue, Aug 20, 2024 at 5:22 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Kuniyuki, > > [...] > > > To be more concise, I would like to state 3 points to see if they are valid: > > > (1) Extending the option for bind() is the last puzzle of using an > > > older algorithm for some users. Since we have one in connect(), how > > > about adding it in bind() to provide for the people favouring the > > > older algorithm. > > > > Why do they want to use bind() to pick a random port in the first place ? > > > > I feel sorry to bother you again. > > Interesting thing is that I just found some of my personal records > that show to me: a lot of applications start active connections using > bind() to find a proper port in Google :) I'm not the only one :p > Probably coming up with the new algo selecting odd/even ports is the > reason/compromise for the bind() and non-bind() users. It gave range/2 > for active flow using bind(). > > That's what I want to share with you :) Hope it will waste you and > Eric precious time :) Sorry, it should be "it won't"...
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 64d07b842e73..6eb509832c90 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -333,19 +333,20 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, struct inet_bind_hashbucket **head2_ret, int *port_ret) { struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk); - int i, low, high, attempt_half, port, l3mdev; + int i, low, high, attempt_half, port, l3mdev, step; struct inet_bind_hashbucket *head, *head2; + bool local_ports, relax = false; struct net *net = sock_net(sk); struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; u32 remaining, offset; - bool relax = false; l3mdev = inet_sk_bound_l3mdev(sk); ports_exhausted: attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; other_half_scan: - inet_sk_get_local_port_range(sk, &low, &high); + local_ports = inet_sk_get_local_port_range(sk, &low, &high); + step = local_ports ? 1 : 2; high++; /* [32768, 60999] -> [32768, 61000[ */ if (high - low < 4) attempt_half = 0; @@ -358,18 +359,19 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, low = half; } remaining = high - low; - if (likely(remaining > 1)) + if (!local_ports && remaining > 1) remaining &= ~1U; offset = get_random_u32_below(remaining); /* __inet_hash_connect() favors ports having @low parity * We do the opposite to not pollute connect() users. */ - offset |= 1U; + if (!local_ports) + offset |= 1U; other_parity_scan: port = low + offset; - for (i = 0; i < remaining; i += 2, port += 2) { + for (i = 0; i < remaining; i += step, port += step) { if (unlikely(port >= high)) port -= remaining; if (inet_is_local_reserved_port(net, port)) @@ -400,9 +402,11 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, cond_resched(); } - offset--; - if (!(offset & 1)) - goto other_parity_scan; + if (!local_ports) { + offset--; + if (!(offset & 1)) + goto other_parity_scan; + } if (attempt_half == 1) { /* OK we now try the upper half of the range */