diff mbox series

[v5,bpf-next,01/11] net: Introduce net.ipv4.tcp_migrate_req.

Message ID 20210510034433.52818-2-kuniyu@amazon.co.jp (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Socket migration for SO_REUSEPORT. | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 12 maintainers not CCed: dsahern@kernel.org pablo@netfilter.org yhs@fb.com kpsingh@kernel.org yoshfuji@linux-ipv6.org andreas.a.roeseler@gmail.com john.fastabend@gmail.com linux-doc@vger.kernel.org corbet@lwn.net songliubraving@fb.com fw@strlen.de amcohen@nvidia.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 6699 this patch: 6699
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 48 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 6912 this patch: 6912
netdev/header_inline success Link

Commit Message

Iwashima, Kuniyuki May 10, 2021, 3:44 a.m. UTC
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
---
 Documentation/networking/ip-sysctl.rst | 20 ++++++++++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 3 files changed, 30 insertions(+)

Comments

Martin KaFai Lau May 15, 2021, 12:47 a.m. UTC | #1
On Mon, May 10, 2021 at 12:44:23PM +0900, Kuniyuki Iwashima wrote:
> This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
> option is enabled or eBPF program is attached, we will be able to migrate
> child sockets from a listener to another in the same reuseport group after
> close() or shutdown() syscalls.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 20 ++++++++++++++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
>  3 files changed, 30 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index c2ecc9894fd0..8e92f9b28aad 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -732,6 +732,26 @@ tcp_syncookies - INTEGER
>  	network connections you can set this knob to 2 to enable
>  	unconditionally generation of syncookies.
>  
> +tcp_migrate_req - INTEGER
> +	The incoming connection is tied to a specific listening socket when
> +	the initial SYN packet is received during the three-way handshake.
> +	When a listener is closed, in-flight request sockets during the
> +	handshake and established sockets in the accept queue are aborted.
> +
> +	If the listener has SO_REUSEPORT enabled, other listeners on the
> +	same port should have been able to accept such connections. This
> +	option makes it possible to migrate such child sockets to another
> +	listener after close() or shutdown().
> +
> +	Default: 0
> +
> +	Note that the source and destination listeners MUST have the same
> +	settings at the socket API level. If different applications listen
It is a bit confusing on what "source and destination listeners" and
"same settings at the socket API level" mean.

Does it mean to say a bpf prog should usually be used to define the policy
to pick an alive listener.  If bpf prog is absence, the kernel will
randomly pick an alive listener only if this sysctl is enabled?

Others lgtm.

Acked-by: Martin KaFai Lau <kafai@fb.com>

> +	on the same port, disable this option or attach the
> +	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program to select
> +	the correct socket by bpf_sk_select_reuseport() or to cancel
> +	migration by returning SK_DROP.
> +
Iwashima, Kuniyuki May 15, 2021, 4:01 a.m. UTC | #2
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Fri, 14 May 2021 17:47:20 -0700
> On Mon, May 10, 2021 at 12:44:23PM +0900, Kuniyuki Iwashima wrote:
> > This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
> > option is enabled or eBPF program is attached, we will be able to migrate
> > child sockets from a listener to another in the same reuseport group after
> > close() or shutdown() syscalls.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > ---
> >  Documentation/networking/ip-sysctl.rst | 20 ++++++++++++++++++++
> >  include/net/netns/ipv4.h               |  1 +
> >  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
> >  3 files changed, 30 insertions(+)
> > 
> > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> > index c2ecc9894fd0..8e92f9b28aad 100644
> > --- a/Documentation/networking/ip-sysctl.rst
> > +++ b/Documentation/networking/ip-sysctl.rst
> > @@ -732,6 +732,26 @@ tcp_syncookies - INTEGER
> >  	network connections you can set this knob to 2 to enable
> >  	unconditionally generation of syncookies.
> >  
> > +tcp_migrate_req - INTEGER
> > +	The incoming connection is tied to a specific listening socket when
> > +	the initial SYN packet is received during the three-way handshake.
> > +	When a listener is closed, in-flight request sockets during the
> > +	handshake and established sockets in the accept queue are aborted.
> > +
> > +	If the listener has SO_REUSEPORT enabled, other listeners on the
> > +	same port should have been able to accept such connections. This
> > +	option makes it possible to migrate such child sockets to another
> > +	listener after close() or shutdown().
> > +
> > +	Default: 0
> > +
> > +	Note that the source and destination listeners MUST have the same
> > +	settings at the socket API level. If different applications listen
> It is a bit confusing on what "source and destination listeners" and
> "same settings at the socket API level" mean.
> 
> Does it mean to say a bpf prog should usually be used to define the policy
> to pick an alive listener.  If bpf prog is absence, the kernel will
> randomly pick an alive listener only if this sysctl is enabled?

Yes.

If there are two listeners having different setsockopt() settings and no
ebpf prog is attached, randam pick may crash applications.

Let's say, the migration happens from listener A to B, and only B has
TCP_SAVE_SYN enabled. Then B cannot read SYN from some requests migrated
from A.

I've written this in commit log in v2, but somehow dropped in v3...
https://lore.kernel.org/netdev/20201207132456.65472-7-kuniyu@amazon.co.jp/

I will change the description more specific.


> 
> Others lgtm.
> 
> Acked-by: Martin KaFai Lau <kafai@fb.com>

Thank you!


> 
> > +	on the same port, disable this option or attach the
> > +	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program to select
> > +	the correct socket by bpf_sk_select_reuseport() or to cancel
> > +	migration by returning SK_DROP.
> > +
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index c2ecc9894fd0..8e92f9b28aad 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -732,6 +732,26 @@  tcp_syncookies - INTEGER
 	network connections you can set this knob to 2 to enable
 	unconditionally generation of syncookies.
 
+tcp_migrate_req - INTEGER
+	The incoming connection is tied to a specific listening socket when
+	the initial SYN packet is received during the three-way handshake.
+	When a listener is closed, in-flight request sockets during the
+	handshake and established sockets in the accept queue are aborted.
+
+	If the listener has SO_REUSEPORT enabled, other listeners on the
+	same port should have been able to accept such connections. This
+	option makes it possible to migrate such child sockets to another
+	listener after close() or shutdown().
+
+	Default: 0
+
+	Note that the source and destination listeners MUST have the same
+	settings at the socket API level. If different applications listen
+	on the same port, disable this option or attach the
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program to select
+	the correct socket by bpf_sk_select_reuseport() or to cancel
+	migration by returning SK_DROP.
+
 tcp_fastopen - INTEGER
 	Enable TCP Fast Open (RFC7413) to send and accept data in the opening
 	SYN packet.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index f6af8d96d3c6..fd63c91addcc 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -126,6 +126,7 @@  struct netns_ipv4 {
 	u8 sysctl_tcp_syn_retries;
 	u8 sysctl_tcp_synack_retries;
 	u8 sysctl_tcp_syncookies;
+	u8 sysctl_tcp_migrate_req;
 	int sysctl_tcp_reordering;
 	u8 sysctl_tcp_retries1;
 	u8 sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a62934b9f15a..7bb013fcbf5f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -940,6 +940,15 @@  static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dou8vec_minmax,
 	},
 #endif
+	{
+		.procname	= "tcp_migrate_req",
+		.data		= &init_net.ipv4.sysctl_tcp_migrate_req,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
 	{
 		.procname	= "tcp_reordering",
 		.data		= &init_net.ipv4.sysctl_tcp_reordering,