diff mbox series

[RFC,net-next] net/smc: Unbind buffer size from clcsock and make it tunable

Message ID 20211122134255.63347-1-tonylu@linux.alibaba.com (mailing list archive)
State RFC
Headers show
Series [RFC,net-next] net/smc: Unbind buffer size from clcsock and make it tunable | expand

Commit Message

Tony Lu Nov. 22, 2021, 1:42 p.m. UTC
SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or
RMB. And the values of buffer size inherits from clcsock. The clcsock is
a TCP sock which is initiated during SMC connection startup.

The inherited buffer size doesn't fit SMC well. TCP provides two sysctl
knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default
value from TCP. The buffer size is tuned for TCP, but not fit SMC well
in some scenarios. For example, we need larger buffer of SMC for high
throughput applications, and smaller buffer of SMC for saving contiguous
memory. We need to adjust the buffer size apart from TCP and not to
disturb TCP.

This unbinds buffer size which inherits from clcsock, and provides
sysctl knobs to adjust buffer size independently. These knobs can be
tuned with different values for different net namespaces for performance
and flexibility.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 Documentation/networking/smc-sysctl.rst | 20 ++++++
 include/net/netns/smc.h                 |  5 ++
 net/smc/Makefile                        |  2 +-
 net/smc/af_smc.c                        | 17 +++++-
 net/smc/smc_sysctl.c                    | 81 +++++++++++++++++++++++++
 net/smc/smc_sysctl.h                    | 22 +++++++
 6 files changed, 144 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/networking/smc-sysctl.rst
 create mode 100644 net/smc/smc_sysctl.c
 create mode 100644 net/smc/smc_sysctl.h

Comments

Karsten Graul Nov. 22, 2021, 3:08 p.m. UTC | #1
On 22/11/2021 14:42, Tony Lu wrote:
> SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or
> RMB. And the values of buffer size inherits from clcsock. The clcsock is
> a TCP sock which is initiated during SMC connection startup.
> 
> The inherited buffer size doesn't fit SMC well. TCP provides two sysctl
> knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default
> value from TCP. The buffer size is tuned for TCP, but not fit SMC well
> in some scenarios. For example, we need larger buffer of SMC for high
> throughput applications, and smaller buffer of SMC for saving contiguous
> memory. We need to adjust the buffer size apart from TCP and not to
> disturb TCP.
> 
> This unbinds buffer size which inherits from clcsock, and provides
> sysctl knobs to adjust buffer size independently. These knobs can be
> tuned with different values for different net namespaces for performance
> and flexibility.
> 
> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
> Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
> ---

To activate SMC for existing programs usually the smc_run command or the
preload library (both from the smc-tools package) are used.
This commit introduced support to set the send and recv window sizes
using command line parameters or environment variables:

https://github.com/ibm-s390-linux/smc-tools/commit/59bfb99c588746f7dca1b3c97fd88f3f7cbc975f

Why another way to manipulate these sizes?
Your solution would stop applications to set these values.
Tony Lu Nov. 23, 2021, 6:56 a.m. UTC | #2
On Mon, Nov 22, 2021 at 04:08:37PM +0100, Karsten Graul wrote:
> On 22/11/2021 14:42, Tony Lu wrote:
> > SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or
> > RMB. And the values of buffer size inherits from clcsock. The clcsock is
> > a TCP sock which is initiated during SMC connection startup.
> > 
> > The inherited buffer size doesn't fit SMC well. TCP provides two sysctl
> > knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default
> > value from TCP. The buffer size is tuned for TCP, but not fit SMC well
> > in some scenarios. For example, we need larger buffer of SMC for high
> > throughput applications, and smaller buffer of SMC for saving contiguous
> > memory. We need to adjust the buffer size apart from TCP and not to
> > disturb TCP.
> > 
> > This unbinds buffer size which inherits from clcsock, and provides
> > sysctl knobs to adjust buffer size independently. These knobs can be
> > tuned with different values for different net namespaces for performance
> > and flexibility.
> > 
> > Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
> > Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
> > ---
> 
> To activate SMC for existing programs usually the smc_run command or the
> preload library (both from the smc-tools package) are used.
> This commit introduced support to set the send and recv window sizes
> using command line parameters or environment variables:
> 
> https://github.com/ibm-s390-linux/smc-tools/commit/59bfb99c588746f7dca1b3c97fd88f3f7cbc975f

Hi Graul,

Thanks for your advice. We are using smc-tools, it is a very useful
tool, and we also use smc_run or LD_PRELOAD to help our applications to
replace with SMC from TCP.

There are some differences to use SMC in our environment. The followings
are our scenarios to use SMC:

1. Transparent acceleration

This approach is widely used in our environment. The main idea of
transparent acceleration is not to touch the applications. The
applications are usually pre-compiled and pre-packaged containers or
ECS, which means the binary and the binary needed environments, like
glibc and other libraries are bundled as bootstrap containers. So it is
hard to inject the smc_run or LD_PRELOAD into application's containers
runtime. 

To solve this issue, we developed a set of patches to replace the
AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration.
So that we can control acceleration in kernel without any other changes
in user-space, and won't break our application containers and publish
workflow. These patches are still improving for upstream.

2. Use SMC explicitly

This approach is very straightforward. Applications just create sockets
using AF_SMC and SMCPROTO_SMC{6}, and SMC works fine. 

However, most of applications don't want to bind tightly to single SMC
protocol. Most of them take into account compatibility, stability and
flexibility.

> Why another way to manipulate these sizes?
> Your solution would stop applications to set these values.

I didn't understand it clearly about stopping applications to set there
values.

IMHO, this RFC introduces two knobs for snd/rcvbuf. During the following
stages, applications can set these values as expected.

1. SMC module or per-net-namespace initialized:

sysctl_{w|r}mem_default initialized in smc_net_init() when current
net namespace initialized. The default values of SMC inherit from TCP,
and clcsock use TCP's configures.

2. create SMC socket:

smc_create() is called, and smc->sk.sk_{snd|rcv}buf are initialized from
per-netns earlier. There are no different from before. Except for
changing the values of TCP after SMC initialized, users can change them
with newly added two knobs.

3. applications call setsockopt() to modify SO_SNDBUF or SO_RCVBUF:

After smc_create() creates socket, applications can use setsockopt() to
change the snd/rcvbuf, which called sock_setsockopt() directly. If
fallback happened, setsockopt() would change clcsock's values.


In the end, we hope to provide a flexibility approach to change
SMC's buffer size only and don't disturb others. Sysctl are considered
as a better way to maintain and easy to use for users.

Thanks,
Tony Lu
Karsten Graul Nov. 23, 2021, 9:33 a.m. UTC | #3
On 23/11/2021 07:56, Tony Lu wrote:
> To solve this issue, we developed a set of patches to replace the
> AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration.
> So that we can control acceleration in kernel without any other changes
> in user-space, and won't break our application containers and publish
> workflow. These patches are still improving for upstream.

This sounds interesting. Will this also be namespace-based like the sysctls
in the current patch? Will this future change integrate nicely with the current
new sysctls? This might allow to activate smc for containers, selectively. 

Please send these changes in a patch series together when they are finished.
We would like to review them as a whole to see how things play together.

Thank you!
Tony Lu Nov. 24, 2021, 4:46 a.m. UTC | #4
On Tue, Nov 23, 2021 at 10:33:07AM +0100, Karsten Graul wrote:
> On 23/11/2021 07:56, Tony Lu wrote:
> > To solve this issue, we developed a set of patches to replace the
> > AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration.
> > So that we can control acceleration in kernel without any other changes
> > in user-space, and won't break our application containers and publish
> > workflow. These patches are still improving for upstream.
> 
> This sounds interesting. Will this also be namespace-based like the sysctls
> in the current patch? Will this future change integrate nicely with the current
> new sysctls? This might allow to activate smc for containers, selectively. 
> 
> Please send these changes in a patch series together when they are finished.
> We would like to review them as a whole to see how things play together.
> 
> Thank you!

Hi Graul,

I am glad to hear that. The container, which is isolated by
net-namespace, is the minimal deployment unit in our environment. The
transparent replacement facility is namespace-based, so that we can
control the applications behaviors according to the dimensions of
containers.

The per-netns sysctl is the first step, control the applications'
buffer, and not to disturb TCP connections in the same container. The
container's memory is not as large as that of a physical machine, and
containers might run different workload applications, so we use
per-netns sysctl to adjust buffer. And it can be sent out separately.

Then, we can allow to activate SMC for some containers with transparent
replacement. These patches are improving, make sure the flexible enough
for containers and applications scopes, and cohesion enough for
upstream.

I will send them out as the containers solutions. Thank for you advice.

Thanks,
Tony Lu
diff mbox series

Patch

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..ba2be59a57dd
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,20 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+SMC Sysctl
+=========
+
+/proc/sys/net/smc/* Variables
+==============================
+
+wmem_default - INTEGER
+    Initial size of send buffer used by SMC sockets.
+    The default value inherits from net.ipv4.tcp_wmem[1].
+
+    Default: 16K
+
+rmem_default - INTEGER
+    Initial size of receive buffer (RMB) used by SMC sockets.
+    The default value inherits from net.ipv4.tcp_rmem[1].
+
+    Default: 131072 bytes.
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index ea8a9cf2619b..f948235e3156 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -12,5 +12,10 @@  struct netns_smc {
 	/* protect fback_rsn */
 	struct mutex			mutex_fback_rsn;
 	struct smc_stats_rsn		*fback_rsn;
+#ifdef CONFIG_SYSCTL
+	struct ctl_table_header		*smc_hdr;
+#endif
+	int				sysctl_wmem_default;
+	int				sysctl_rmem_default;
 };
 #endif
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 196fb6f01b14..640af9a39f9c 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,4 +4,4 @@  obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
 smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 5e93a5aa347e..543a6180be1d 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -51,6 +51,7 @@ 
 #include "smc_close.h"
 #include "smc_stats.h"
 #include "smc_tracepoint.h"
+#include "smc_sysctl.h"
 
 static DEFINE_MUTEX(smc_server_lgr_pending);	/* serialize link group
 						 * creation on server
@@ -2722,8 +2723,8 @@  static int smc_create(struct net *net, struct socket *sock, int protocol,
 		sk_common_release(sk);
 		goto out;
 	}
-	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
-	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
+	smc->sk.sk_sndbuf = sock_net(sk)->smc.sysctl_wmem_default;
+	smc->sk.sk_rcvbuf = sock_net(sk)->smc.sysctl_rmem_default;
 
 out:
 	return rc;
@@ -2739,6 +2740,11 @@  unsigned int smc_net_id;
 
 static __net_init int smc_net_init(struct net *net)
 {
+	net->smc.sysctl_wmem_default = max(net->ipv4.sysctl_tcp_wmem[1],
+					   SMC_BUF_MIN_SIZE);
+	net->smc.sysctl_rmem_default = max(net->ipv4.sysctl_tcp_rmem[1],
+					   SMC_BUF_MIN_SIZE);
+
 	return smc_pnet_net_init(net);
 }
 
@@ -2845,6 +2851,12 @@  static int __init smc_init(void)
 		goto out_sock;
 	}
 
+	rc = smc_sysctl_init();
+	if (rc) {
+		pr_err("%s: sysctl fails with %d\n", __func__, rc);
+		goto out_sock;
+	}
+
 	static_branch_enable(&tcp_have_smc);
 	return 0;
 
@@ -2885,6 +2897,7 @@  static void __exit smc_exit(void)
 	smc_clc_exit();
 	unregister_pernet_subsys(&smc_net_stat_ops);
 	unregister_pernet_subsys(&smc_net_ops);
+	smc_sysctl_exit();
 	rcu_barrier();
 }
 
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
new file mode 100644
index 000000000000..6706fe1bd888
--- /dev/null
+++ b/net/smc/smc_sysctl.c
@@ -0,0 +1,81 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sysctl.h>
+#include <net/sock.h>
+#include <net/net_namespace.h>
+
+#include "smc_core.h"
+
+static int min_sndbuf = SMC_BUF_MIN_SIZE;
+static int min_rcvbuf = SMC_BUF_MIN_SIZE;
+
+static struct ctl_table smc_table[] = {
+	{
+		.procname	= "wmem_default",
+		.data		= &init_net.smc.sysctl_wmem_default,
+		.maxlen		= sizeof(init_net.smc.sysctl_wmem_default),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_sndbuf,
+	},
+	{
+		.procname	= "rmem_default",
+		.data		= &init_net.smc.sysctl_rmem_default,
+		.maxlen		= sizeof(init_net.smc.sysctl_rmem_default),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_rcvbuf,
+	},
+	{  }
+};
+
+static __net_init int smc_sysctl_init_net(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = smc_table;
+	if (!net_eq(net, &init_net)) {
+		int i;
+
+		table = kmemdup(table, sizeof(smc_table), GFP_KERNEL);
+		if (!table)
+			goto err_alloc;
+
+		for (i = 0; i < ARRAY_SIZE(smc_table) - 1; i++)
+			table[i].data += (void *)net - (void *)&init_net;
+	}
+
+	net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);
+	if (!net->smc.smc_hdr)
+		goto err_reg;
+
+	return 0;
+
+err_reg:
+	if (!net_eq(net, &init_net))
+		kfree(table);
+err_alloc:
+	return -ENOMEM;
+}
+
+static __net_exit void smc_sysctl_exit_net(struct net *net)
+{
+	unregister_net_sysctl_table(net->smc.smc_hdr);
+}
+
+static struct pernet_operations smc_sysctl_ops __net_initdata = {
+	.init = smc_sysctl_init_net,
+	.exit = smc_sysctl_exit_net,
+};
+
+int __init smc_sysctl_init(void)
+{
+	return register_pernet_subsys(&smc_sysctl_ops);
+}
+
+void smc_sysctl_exit(void)
+{
+	unregister_pernet_subsys(&smc_sysctl_ops);
+}
diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
new file mode 100644
index 000000000000..c01c5de3a3ea
--- /dev/null
+++ b/net/smc/smc_sysctl.h
@@ -0,0 +1,22 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _SMC_SYSCTL_H
+#define _SMC_SYSCTL_H
+
+#ifdef CONFIG_SYSCTL
+
+int smc_sysctl_init(void);
+void smc_sysctl_exit(void);
+
+#else
+
+int smc_sysctl_init(void)
+{
+	return 0;
+}
+
+void smc_sysctl_exit(void) { }
+
+#endif /* CONFIG_SYSCTL */
+
+#endif /* _SMC_SYSCTL_H */