diff mbox series

[net-next] ipvs: allow some sysctls in non-init user namespaces

Message ID 20240416144814.173185-1-aleksandr.mikhalitsyn@canonical.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [net-next] ipvs: allow some sysctls in non-init user namespaces | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 927 this patch: 927
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 4 maintainers not CCed: pabeni@redhat.com edumazet@google.com coreteam@netfilter.org kuba@kernel.org
netdev/build_clang success Errors and warnings before: 937 this patch: 937
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 938 this patch: 938
netdev/checkpatch warning WARNING: line length of 101 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-16--21-00 (tests: 958)

Commit Message

Aleksandr Mikhalitsyn April 16, 2024, 2:48 p.m. UTC
Let's make all IPVS sysctls visible and RO even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be writable:
- conntrack
- conn_reuse_mode
- expire_nodest_conn
- expire_quiescent_template

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This list of sysctls was chosen because I can't
see any security risks allowing them and also
Kubernetes uses [2] these specific sysctls.

This patch is motivated by user request in the LXC
project [1].

[1] https://github.com/lxc/lxc/issues/4278
[2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103

Cc: Stéphane Graber <stgraber@stgraber.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 net/netfilter/ipvs/ip_vs_ctl.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Julian Anastasov April 17, 2024, 1:02 p.m. UTC | #1
Hello,

On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls visible and RO even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be writable:
> - conntrack
> - conn_reuse_mode
> - expire_nodest_conn
> - expire_quiescent_template
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This list of sysctls was chosen because I can't
> see any security risks allowing them and also
> Kubernetes uses [2] these specific sysctls.
> 
> This patch is motivated by user request in the LXC
> project [1].
> 
> [1] https://github.com/lxc/lxc/issues/4278
> [2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> 
> Cc: Stéphane Graber <stgraber@stgraber.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
> Cc: Florian Westphal <fw@strlen.de>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..92a818c2f783 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4285,10 +4285,22 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)

	As the list of privileged vars is short I prefer
to use a bool and to make only some vars read-only:

	bool unpriv = false;

>  		if (tbl == NULL)
>  			return -ENOMEM;
>  
> -		/* Don't export sysctls to unprivileged users */
> +		/* Let's show all sysctls in non-init user namespace-owned
> +		 * net namespaces, but make them read-only.
> +		 *
> +		 * Allow only a few specific sysctls to be writable.
> +		 */
>  		if (net->user_ns != &init_user_ns) {

	Here we should just set: unpriv = true;

> -			tbl[0].procname = NULL;
> -			ctl_table_size = 0;
> +			for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> +				if (!tbl[idx].procname)
> +					continue;
> +
> +				if (!((strcmp(tbl[idx].procname, "conntrack") == 0) ||
> +				      (strcmp(tbl[idx].procname, "conn_reuse_mode") == 0) ||
> +				      (strcmp(tbl[idx].procname, "expire_nodest_conn") == 0) ||
> +				      (strcmp(tbl[idx].procname, "expire_quiescent_template") == 0)))
> +					tbl[idx].mode = 0444;
> +			}
>  		}
>  	} else
>  		tbl = vs_vars;

	And below at every place to use:

	if (unpriv)
		tbl[idx].mode = 0444;

	for the following 4 privileged sysctl vars:

- sync_qlen_max:
	- allocates messages in kernel context
	- this needs better tunning in another patch

- sync_sock_size:
	- allocates messages in kernel context

- run_estimation:
	- for now, better init ns to decide if to use est stats

- est_nice:
	- for now, better init ns to decide the value

- debug_level:
	- already set to 0444

	I.e. these vars allocate resources (mem, CPU) without
proper control, so for now we will just copy them from init ns
without allowing writing. And they are vars that are not tuned
often. Also we do not know which netns is supposed to be the
privileged one, some solutions move all devices out of init_net,
so we can not decide where to use lower limits.

	OTOH, "amemthresh" is not privileged but needs single READ_ONCE 
for sysctl_amemthresh in update_defense_level() due to the possible
div by zero if we allow writing to anyone, eg.:

	int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
	...
	nomem = availmem < amemthresh;
	... use only amemthresh

	All other vars can be writable.

Regards

--
Julian Anastasov <ja@ssi.bg>
Aleksandr Mikhalitsyn April 18, 2024, 11:05 a.m. UTC | #2
On Wed, Apr 17, 2024 at 3:02 PM Julian Anastasov <ja@ssi.bg> wrote:
>
>
>         Hello,

Dear Julian,

>
> On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:
>
> > Let's make all IPVS sysctls visible and RO even when
> > network namespace is owned by non-initial user namespace.
> >
> > Let's make a few sysctls to be writable:
> > - conntrack
> > - conn_reuse_mode
> > - expire_nodest_conn
> > - expire_quiescent_template
> >
> > I'm trying to be conservative with this to prevent
> > introducing any security issues in there. Maybe,
> > we can allow more sysctls to be writable, but let's
> > do this on-demand and when we see real use-case.
> >
> > This list of sysctls was chosen because I can't
> > see any security risks allowing them and also
> > Kubernetes uses [2] these specific sysctls.
> >
> > This patch is motivated by user request in the LXC
> > project [1].
> >
> > [1] https://github.com/lxc/lxc/issues/4278
> > [2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> >
> > Cc: Stéphane Graber <stgraber@stgraber.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Julian Anastasov <ja@ssi.bg>
> > Cc: Simon Horman <horms@verge.net.au>
> > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
> > Cc: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++++++++++++++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 143a341bbc0a..92a818c2f783 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -4285,10 +4285,22 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>
>         As the list of privileged vars is short I prefer
> to use a bool and to make only some vars read-only:
>
>         bool unpriv = false;
>
> >               if (tbl == NULL)
> >                       return -ENOMEM;
> >
> > -             /* Don't export sysctls to unprivileged users */
> > +             /* Let's show all sysctls in non-init user namespace-owned
> > +              * net namespaces, but make them read-only.
> > +              *
> > +              * Allow only a few specific sysctls to be writable.
> > +              */
> >               if (net->user_ns != &init_user_ns) {
>
>         Here we should just set: unpriv = true;
>
> > -                     tbl[0].procname = NULL;
> > -                     ctl_table_size = 0;
> > +                     for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> > +                             if (!tbl[idx].procname)
> > +                                     continue;
> > +
> > +                             if (!((strcmp(tbl[idx].procname, "conntrack") == 0) ||
> > +                                   (strcmp(tbl[idx].procname, "conn_reuse_mode") == 0) ||
> > +                                   (strcmp(tbl[idx].procname, "expire_nodest_conn") == 0) ||
> > +                                   (strcmp(tbl[idx].procname, "expire_quiescent_template") == 0)))
> > +                                     tbl[idx].mode = 0444;
> > +                     }
> >               }
> >       } else
> >               tbl = vs_vars;
>
>         And below at every place to use:
>
>         if (unpriv)
>                 tbl[idx].mode = 0444;
>
>         for the following 4 privileged sysctl vars:
>
> - sync_qlen_max:
>         - allocates messages in kernel context
>         - this needs better tunning in another patch
>
> - sync_sock_size:
>         - allocates messages in kernel context
>
> - run_estimation:
>         - for now, better init ns to decide if to use est stats
>
> - est_nice:
>         - for now, better init ns to decide the value
>
> - debug_level:
>         - already set to 0444
>
>         I.e. these vars allocate resources (mem, CPU) without
> proper control, so for now we will just copy them from init ns
> without allowing writing. And they are vars that are not tuned
> often. Also we do not know which netns is supposed to be the
> privileged one, some solutions move all devices out of init_net,
> so we can not decide where to use lower limits.

I agree. I have also decided to forbid "est_cpulist" for unprivileged users.

>
>         OTOH, "amemthresh" is not privileged but needs single READ_ONCE
> for sysctl_amemthresh in update_defense_level() due to the possible
> div by zero if we allow writing to anyone, eg.:
>
>         int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
>         ...
>         nomem = availmem < amemthresh;
>         ... use only amemthresh
>
>         All other vars can be writable.

Have fixed this and sent it as a separate patch! ;-)

Thanks a lot for such a quick review!

Kind regards,
Alex

>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
diff mbox series

Patch

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..92a818c2f783 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4285,10 +4285,22 @@  static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 		if (tbl == NULL)
 			return -ENOMEM;
 
-		/* Don't export sysctls to unprivileged users */
+		/* Let's show all sysctls in non-init user namespace-owned
+		 * net namespaces, but make them read-only.
+		 *
+		 * Allow only a few specific sysctls to be writable.
+		 */
 		if (net->user_ns != &init_user_ns) {
-			tbl[0].procname = NULL;
-			ctl_table_size = 0;
+			for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
+				if (!tbl[idx].procname)
+					continue;
+
+				if (!((strcmp(tbl[idx].procname, "conntrack") == 0) ||
+				      (strcmp(tbl[idx].procname, "conn_reuse_mode") == 0) ||
+				      (strcmp(tbl[idx].procname, "expire_nodest_conn") == 0) ||
+				      (strcmp(tbl[idx].procname, "expire_quiescent_template") == 0)))
+					tbl[idx].mode = 0444;
+			}
 		}
 	} else
 		tbl = vs_vars;