diff mbox series

[v3] kmod: harden user namespaces with new kernel.ns_modules_allowed sysctl

Message ID 20230112131911.7684-1-vegard.nossum@oracle.com (mailing list archive)
State New, archived
Headers show
Series [v3] kmod: harden user namespaces with new kernel.ns_modules_allowed sysctl | expand

Commit Message

Vegard Nossum Jan. 12, 2023, 1:19 p.m. UTC
Creating a new user namespace grants you the ability to reach a lot of code
(including loading certain kernel modules) that would otherwise be out of
reach of an attacker. We can reduce the attack surface and block exploits
by ensuring that user namespaces cannot trigger module (auto-)loading.

A cursory search of exploits found online yields the following
non-exhaustive list of vulnerabilities, and shows that the technique is
both old and still in use:

- CVE-2016-8655
- CVE-2017-1000112
- CVE-2021-32606
- CVE-2022-2588
- CVE-2022-27666
- CVE-2022-34918
- CVE-2023-0179

A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS
Stream, and Oracle Linux allow unprivileged user namespaces by default,
probably to support sandboxing in browsers and containers. Major
exceptions would be Debian and Arch Linux which carry an out-of-tree patch
to disable user namespaces by default.

This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
0 will block requests to load modules when the request originates in a
process running in a user namespace.

For backwards compatibility, the default value of the sysctl is set to
CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
there should be absolutely no change in behaviour unless you opt in either
at compile time or at runtime.

This mitigation obviously offers no protection if the vulnerable module is
already loaded, but for many of these exploits the vast majority of users
will never actually load or use these modules on purpose; in other words,
for the vast majority of users, this would block exploits for the above
list of vulnerabilities.

Testing: Running the reproducer for CVE-2022-2588 fails and results in the
following message in the kernel log:

    [  130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl

v2:
- fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init()
- use .maxlen = sizeof(int) for proc_dobool()
- don't warn when sysctl_ns_modules_allowed == 1

v3:
- drop capable(CAP_SYS_MODULE) check
- add a new CVE to changelog :-)
- add survey of distros that enable unpriv userns to changelog

Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1
Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2
Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-hardening@vger.kernel.org
Cc: linux-modules@vger.kernel.org
Cc: John Haxby <john.haxby@oracle.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++
 init/Kconfig                                | 17 ++++++++++
 kernel/kmod.c                               | 37 +++++++++++++++++++++
 3 files changed, 65 insertions(+)

Comments

Serge E. Hallyn Jan. 12, 2023, 6 p.m. UTC | #1
On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
> Creating a new user namespace grants you the ability to reach a lot of code
> (including loading certain kernel modules) that would otherwise be out of
> reach of an attacker. We can reduce the attack surface and block exploits
> by ensuring that user namespaces cannot trigger module (auto-)loading.
> 
> A cursory search of exploits found online yields the following
> non-exhaustive list of vulnerabilities, and shows that the technique is
> both old and still in use:
> 
> - CVE-2016-8655
> - CVE-2017-1000112
> - CVE-2021-32606
> - CVE-2022-2588
> - CVE-2022-27666
> - CVE-2022-34918
> - CVE-2023-0179

I think it would be worth pointing out how many of the above would
actually be aided by this patch.  The first two would not, but certainly
at least the can module one counts.  So I support this at least in
principle.  I'll take a closer look at the code hopefully tonight.

> A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS
> Stream, and Oracle Linux allow unprivileged user namespaces by default,
> probably to support sandboxing in browsers and containers. Major
> exceptions would be Debian and Arch Linux which carry an out-of-tree patch
> to disable user namespaces by default.
> 
> This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
> 0 will block requests to load modules when the request originates in a
> process running in a user namespace.
> 
> For backwards compatibility, the default value of the sysctl is set to
> CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
> there should be absolutely no change in behaviour unless you opt in either
> at compile time or at runtime.
> 
> This mitigation obviously offers no protection if the vulnerable module is
> already loaded, but for many of these exploits the vast majority of users
> will never actually load or use these modules on purpose; in other words,
> for the vast majority of users, this would block exploits for the above
> list of vulnerabilities.
> 
> Testing: Running the reproducer for CVE-2022-2588 fails and results in the
> following message in the kernel log:
> 
>     [  130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl
> 
> v2:
> - fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init()
> - use .maxlen = sizeof(int) for proc_dobool()
> - don't warn when sysctl_ns_modules_allowed == 1
> 
> v3:
> - drop capable(CAP_SYS_MODULE) check
> - add a new CVE to changelog :-)
> - add survey of distros that enable unpriv userns to changelog
> 
> Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1
> Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2
> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-hardening@vger.kernel.org
> Cc: linux-modules@vger.kernel.org
> Cc: John Haxby <john.haxby@oracle.com>
> Cc: Jann Horn <jannh@google.com>
> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++
>  init/Kconfig                                | 17 ++++++++++
>  kernel/kmod.c                               | 37 +++++++++++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 46e3d62c0eea..bd9b4e911a6a 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -612,6 +612,17 @@ A value of 0 means no change. The default value is 200 meaning the NMI
>  watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
>  
>  
> +ns_modules_allowed
> +==================
> +
> +Control whether processes may trigger module loading inside a user namespace.
> +
> += =================================
> +0 Deny module loading requests.
> +1 Accept module loading requests.
> += =================================
> +
> +
>  numa_balancing
>  ==============
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index 7e5c3ddc341d..5d9ab43a24b9 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1249,6 +1249,23 @@ config USER_NS
>  
>  	  If unsure, say N.
>  
> +config NS_MODULES_ALLOWED_DEFAULT_ON
> +	bool "Allow user namespaces to auto-load kernel modules by default"
> +	depends on MODULES
> +	depends on USER_NS
> +	default y
> +	help
> +	  This option makes it so that processes running inside user
> +	  namespaces may auto-load kernel modules.
> +
> +	  Say N to mitigate some exploits that rely on being able to
> +	  auto-load kernel modules; however, this may also cause some
> +	  legitimate programs to fail unless kernel modules are loaded by
> +	  hand.
> +
> +	  You can write 0 or 1 to /proc/sys/kernel/ns_modules_allowed to
> +	  change behaviour at run-time.
> +
>  config PID_NS
>  	bool "PID Namespaces"
>  	default y
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index b717134ebe17..938c0a39381a 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -25,6 +25,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/async.h>
>  #include <linux/uaccess.h>
> +#include <linux/sysctl.h>
>  
>  #include <trace/events/module.h>
>  
> @@ -105,6 +106,12 @@ static int call_modprobe(char *module_name, int wait)
>  	return -ENOMEM;
>  }
>  
> +/*
> + * Allow processes running inside namespaces to trigger module loading?
> + */
> +static bool sysctl_ns_modules_allowed __read_mostly =
> +	IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON);
> +
>  /**
>   * __request_module - try to load a kernel module
>   * @wait: wait (or not) for the operation to complete
> @@ -148,6 +155,16 @@ int __request_module(bool wait, const char *fmt, ...)
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * Disallow module loading if we're in a user namespace.
> +	 */
> +	if (current_user_ns() != &init_user_ns &&
> +	    !sysctl_ns_modules_allowed) {
> +		pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
> +			task_pid_nr(current), current->comm, module_name);
> +		return -EPERM;
> +	}
> +
>  	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
>  		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
>  				    atomic_read(&kmod_concurrent_max),
> @@ -175,3 +192,23 @@ int __request_module(bool wait, const char *fmt, ...)
>  	return ret;
>  }
>  EXPORT_SYMBOL(__request_module);
> +
> +#ifdef CONFIG_SYSCTL
> +static struct ctl_table kmod_sysctl_table[] = {
> +	{
> +		.procname       = "ns_modules_allowed",
> +		.data           = &sysctl_ns_modules_allowed,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dobool,
> +	},
> +	{ }
> +};
> +
> +static int __init kmod_sysctl_init(void)
> +{
> +	register_sysctl_init("kernel", kmod_sysctl_table);
> +	return 0;
> +}
> +late_initcall(kmod_sysctl_init);
> +#endif
> -- 
> 2.35.1.46.g38062e73e0
Luis Chamberlain Jan. 12, 2023, 6:45 p.m. UTC | #2
On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
> Creating a new user namespace grants you the ability to reach a lot of code
> (including loading certain kernel modules) that would otherwise be out of
> reach of an attacker. We can reduce the attack surface and block exploits
> by ensuring that user namespaces cannot trigger module (auto-)loading.
> 
> A cursory search of exploits found online yields the following
> non-exhaustive list of vulnerabilities, and shows that the technique is
> both old and still in use:
> 
> - CVE-2016-8655
> - CVE-2017-1000112
> - CVE-2021-32606
> - CVE-2022-2588
> - CVE-2022-27666
> - CVE-2022-34918
> - CVE-2023-0179
> 
> A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS
> Stream, and Oracle Linux allow unprivileged user namespaces by default,
> probably to support sandboxing in browsers and containers. Major
> exceptions would be Debian and Arch Linux which carry an out-of-tree patch
> to disable user namespaces by default.
> 
> This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
> 0 will block requests to load modules when the request originates in a
> process running in a user namespace.
> 
> For backwards compatibility, the default value of the sysctl is set to
> CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
> there should be absolutely no change in behaviour unless you opt in either
> at compile time or at runtime.
> 
> This mitigation obviously offers no protection if the vulnerable module is
> already loaded, but for many of these exploits the vast majority of users
> will never actually load or use these modules on purpose; in other words,
> for the vast majority of users, this would block exploits for the above
> list of vulnerabilities.
> 
> Testing: Running the reproducer for CVE-2022-2588 fails and results in the
> following message in the kernel log:
> 
>     [  130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl
> 
> v2:
> - fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init()
> - use .maxlen = sizeof(int) for proc_dobool()
> - don't warn when sysctl_ns_modules_allowed == 1
> 
> v3:
> - drop capable(CAP_SYS_MODULE) check
> - add a new CVE to changelog :-)
> - add survey of distros that enable unpriv userns to changelog
> 
> Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1
> Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2
> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-hardening@vger.kernel.org
> Cc: linux-modules@vger.kernel.org
> Cc: John Haxby <john.haxby@oracle.com>
> Cc: Jann Horn <jannh@google.com>
> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++
>  init/Kconfig                                | 17 ++++++++++
>  kernel/kmod.c                               | 37 +++++++++++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 46e3d62c0eea..bd9b4e911a6a 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -612,6 +612,17 @@ A value of 0 means no change. The default value is 200 meaning the NMI
>  watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
>  
>  
> +ns_modules_allowed
> +==================
> +
> +Control whether processes may trigger module loading inside a user namespace.

This is false documentation. The place it is trying to protect simply
prevents trying to call modprobe for auto-loading within the kernel.

> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index b717134ebe17..938c0a39381a 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -25,6 +25,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/async.h>
>  #include <linux/uaccess.h>
> +#include <linux/sysctl.h>
>  
>  #include <trace/events/module.h>
>  
> @@ -105,6 +106,12 @@ static int call_modprobe(char *module_name, int wait)
>  	return -ENOMEM;
>  }
>  
> +/*
> + * Allow processes running inside namespaces to trigger module loading?
> + */
> +static bool sysctl_ns_modules_allowed __read_mostly =
> +	IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON);
> +
>  /**
>   * __request_module - try to load a kernel module
>   * @wait: wait (or not) for the operation to complete
> @@ -148,6 +155,16 @@ int __request_module(bool wait, const char *fmt, ...)
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * Disallow module loading if we're in a user namespace.
> +	 */
> +	if (current_user_ns() != &init_user_ns &&
> +	    !sysctl_ns_modules_allowed) {
> +		pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
> +			task_pid_nr(current), current->comm, module_name);
> +		return -EPERM;
> +	}
> +
>  	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
>  		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
>  				    atomic_read(&kmod_concurrent_max),

Have you seen what call_modprobe() does?

This is just a limitting the auto-loading through calling modprobe.
If the concern is to load modules wouldn't you be better off just
putting a stop gap at finit_module() which actually receives the
load attempt from modprobe? Ie, an evil namespace, if it has access
to /sbin/modprobe could simply just try calling /sbin/modprobe on its
own.

Beating the royal shit out of kmod is already stress tested via
tools/testing/selftests/kmod/kmod.sh in particular:

tools/testing/selftests/kmod/kmod.sh -t 0008
tools/testing/selftests/kmod/kmod.sh -t 0009

What this *could* do is race to force a failure on some other *real*
modprobe request we do wish to honor when the above kmod kmod_concurrent_max
is triggered.

So in terms of justification, this commit log needs a bit more work as I
just can't see how this alone is fixing any CVE.

Note: actually *racing* tons of requests at the same time could end
up then *in kernel requests* being processed, and there lies another
shit show which we've recently been nose diving into through the work
and report by Petr Pavlu [0]. Although I have that patch in modules-next,
it was too soon to send to Linus for v6.2-rc1 due to lack of testing on
linux-next. You may want to test that too, as its an old regression.
Even though the commit log describe honest requests for modules, surely
an exploit bad namespace could in theory try to race other requests too
to prefer its own frequency module, etc.

So let's take a step back and think this through. What exaclty and why
would this commit fix *any* security issue? Itemizing CVEs won't cut it.

[0] https://lkml.kernel.org/r/20221205103557.18363-1-petr.pavlu@suse.com

  Luis
Vegard Nossum Jan. 12, 2023, 9:53 p.m. UTC | #3
On 1/12/23 19:45, Luis Chamberlain wrote:
> On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
>>   
>> +ns_modules_allowed
>> +==================
>> +
>> +Control whether processes may trigger module loading inside a user namespace.
> 
> This is false documentation. The place it is trying to protect simply
> prevents trying to call modprobe for auto-loading within the kernel.

I don't think this is false -- but yes, this only protects module
auto-loading in user namespaces; all auto-loading calls within the
kernel should be going through this __request_module() -> modprobe path.

init_module()/finit_module(), the mechanism used by modprobe, are
themselves already restricted inside user namespaces, see below.

>> +	/*
>> +	 * Disallow module loading if we're in a user namespace.
>> +	 */
>> +	if (current_user_ns() != &init_user_ns &&
>> +	    !sysctl_ns_modules_allowed) {
>> +		pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
>> +			task_pid_nr(current), current->comm, module_name);
>> +		return -EPERM;
>> +	}
>> +
>>   	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
>>   		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
>>   				    atomic_read(&kmod_concurrent_max),
> 
> Have you seen what call_modprobe() does?

Yes.

> This is just a limitting the auto-loading through calling modprobe.
> If the concern is to load modules wouldn't you be better off just
> putting a stop gap at finit_module() which actually receives the
> load attempt from modprobe? Ie, an evil namespace, if it has access
> to /sbin/modprobe could simply just try calling /sbin/modprobe on its
> own.

No.

Root inside a user namespace can't call finit_module() as it won't have
the necessary capabilities in the init namespace, see may_init_module().

modprobe, on the other hand, when called by the kernel, is called
through usermode helper, which runs in the init namespace as root, so it
can do whatever it wants.

If modprobe called by root inside a user namespace could load anything,
that itself would be a security issue. But it can't, so it's not.

> Beating the royal shit out of kmod is already stress tested via
> tools/testing/selftests/kmod/kmod.sh in particular:
> 
> tools/testing/selftests/kmod/kmod.sh -t 0008
> tools/testing/selftests/kmod/kmod.sh -t 0009
> 
> What this *could* do is race to force a failure on some other *real*
> modprobe request we do wish to honor when the above kmod kmod_concurrent_max
> is triggered.

How? My new check is done before the kmod_concurrent_max check/critical
section... the check doesn't cause any more modprobe requests to happen
in the first place, the only thing it can do is make them exit early.
There is no way my patch can make this worse.

> So in terms of justification, this commit log needs a bit more work as I
> just can't see how this alone is fixing any CVE.

[...]

> So let's take a step back and think this through. What exaclty and why
> would this commit fix *any* security issue? Itemizing CVEs won't cut it.

I can include my explanations above in the changelog if you think that
will help.


Vegard
Luis Chamberlain Jan. 13, 2023, 11:06 p.m. UTC | #4
On Thu, Jan 12, 2023 at 10:53:07PM +0100, Vegard Nossum wrote:
> On 1/12/23 19:45, Luis Chamberlain wrote:
> > On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
> > > +ns_modules_allowed
> > > +==================
> > > +
> > > +Control whether processes may trigger module loading inside a user namespace.
> > 
> > This is false documentation. The place it is trying to protect simply
> > prevents trying to call modprobe for auto-loading within the kernel.
> 
> I don't think this is false -- but yes, this only protects module
> auto-loading in user namespaces; all auto-loading calls within the
> kernel should be going through this __request_module() -> modprobe path.
> 
> init_module()/finit_module(), the mechanism used by modprobe, are
> themselves already restricted inside user namespaces, see below.

The documentation should be clear about the exact nature of what
mechanism is prevented to load.

> > > +	/*
> > > +	 * Disallow module loading if we're in a user namespace.
> > > +	 */
> > > +	if (current_user_ns() != &init_user_ns &&
> > > +	    !sysctl_ns_modules_allowed) {
> > > +		pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
> > > +			task_pid_nr(current), current->comm, module_name);
> > > +		return -EPERM;
> > > +	}
> > > +
> > >   	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
> > >   		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
> > >   				    atomic_read(&kmod_concurrent_max),
> > 
> > Have you seen what call_modprobe() does?
> 
> Yes.
> 
> > This is just a limitting the auto-loading through calling modprobe.
> > If the concern is to load modules wouldn't you be better off just
> > putting a stop gap at finit_module() which actually receives the
> > load attempt from modprobe? Ie, an evil namespace, if it has access
> > to /sbin/modprobe could simply just try calling /sbin/modprobe on its
> > own.
> 
> No.
> 
> Root inside a user namespace can't call finit_module() as it won't have
> the necessary capabilities in the init namespace, see may_init_module().

I think the documentation to this knob you are adding should explain
this as well to give a bette context as to why it is useful.

And if may_init_module() suffices, why not just add a may_init_module()
instead of your check? Why would we allow a successful call to modprobe
with your sysctl but end up having it ignored in the end by
finit_module()? If what is being allowed here is to overcome may_init_module()
CAP_SYS_MODULE check by using call_modprobe() on the user namespace
the commit log should just mention this, and mention that by design
user namespaces have never been intended to allow loading modules, even if
they somehow end up triggering auto-loading of modules via request_module().

If this is true then currently we are enabling auto-loading as a
mistake, and the sysctl is only valuable to prevent regressions with
existing behaviour which should have been disabled from the start.

Is there an example successful exploit which takes avantage of having
a user namespace auto-load a kernel module and that triggers a security
flaw? What's an example of one?

> > What this *could* do is race to force a failure on some other *real*
> > modprobe request we do wish to honor when the above kmod kmod_concurrent_max
> > is triggered.
> 
> How? My new check is done before the kmod_concurrent_max check/critical
> section... the check doesn't cause any more modprobe requests to happen
> in the first place, the only thing it can do is make them exit early.
> There is no way my patch can make this worse.

You are missing my point. My point is that abuse in general over having
a user namespace calling modprobe could cause harm that it can possibly
Denial-of-Service valid requests.

If user namespaces should never have been allowed to even auto-load the
commit log should mention that.

> > So in terms of justification, this commit log needs a bit more work as I
> > just can't see how this alone is fixing any CVE.
> 
> [...]
> 
> > So let's take a step back and think this through. What exaclty and why
> > would this commit fix *any* security issue? Itemizing CVEs won't cut it.
> 
> I can include my explanations above in the changelog if you think that
> will help.

Answering the above questions would help.

  Luis
Vegard Nossum March 1, 2023, 11:09 a.m. UTC | #5
On 1/12/23 19:00, Serge E. Hallyn wrote:
> On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
>> Creating a new user namespace grants you the ability to reach a lot of code
>> (including loading certain kernel modules) that would otherwise be out of
>> reach of an attacker. We can reduce the attack surface and block exploits
>> by ensuring that user namespaces cannot trigger module (auto-)loading.
>>
>> A cursory search of exploits found online yields the following
>> non-exhaustive list of vulnerabilities, and shows that the technique is
>> both old and still in use:
>>
>> - CVE-2016-8655
>> - CVE-2017-1000112
>> - CVE-2021-32606
>> - CVE-2022-2588
>> - CVE-2022-27666
>> - CVE-2022-34918
>> - CVE-2023-0179
> 
> I think it would be worth pointing out how many of the above would
> actually be aided by this patch.  The first two would not, but certainly
> at least the can module one counts.  So I support this at least in
> principle.  I'll take a closer look at the code hopefully tonight.

The intention was to list _only_ CVEs with exploits that would be
mitigated by this patch. Let me go through them one by one, just using
public exploits found with Google:

CVE-2016-8655: this uses AF_PACKET. I guess your objection is that
AF_PACKET is rarely built as a module and even then would most certainly
already be loaded as part of regular operations? I see at least one
distro kernel having used CONFIG_PACKET=m in the past, so I wouldn't
write this off completely. You need CAP_NET_RAW to create this socket
type AFAICT, which is why the exploit uses user/network namespaces.

CVE-2017-1000112: uses AF_INET. Agreed that this would certainly be
compiled in or already loaded, so we can drop this.

CVE-2021-32606: needs CONFIG_CAN_ISOTP=m/can-isotp.

CVE-2022-2588: needs CONFIG_NET_CLS_ROUTE4=m/cls_route.

CVE-2022-27666: needs CONFIG_INET6_ESP=m/esp6.

CVE-2022-34918: needs CONFIG_NF_TABLES=m/nf_tables at least.

CVE-2023-0179: needs CONFIG_NF_TABLES=m/nf_tables at least.

All the exploits seem to be using user namespaces, for CVE-2017-1000112
I think it needs it to set the MTU of a dummy interface. I'm happy to
drop this CVE from the list (I probably looked too fast when looking at
it), but I think the rest are legitimate. Added Andrey Konovalov to Cc
as he wrote the exploit I looked at and can maybe confirm (and in
general has more experience with exploits).

Thanks again for looking at this -- I will try to address Luis's
comments in this thread as well and send out a v4 when we agree on the
required changes.


Vegard
Vegard Nossum March 1, 2023, 12:37 p.m. UTC | #6
On 3/1/23 12:09, Vegard Nossum wrote:
> 
> On 1/12/23 19:00, Serge E. Hallyn wrote:
>> On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote:
>>> Creating a new user namespace grants you the ability to reach a lot 
>>> of code
>>> (including loading certain kernel modules) that would otherwise be 
>>> out of
>>> reach of an attacker. We can reduce the attack surface and block 
>>> exploits
>>> by ensuring that user namespaces cannot trigger module (auto-)loading.
>>>
>>> A cursory search of exploits found online yields the following
>>> non-exhaustive list of vulnerabilities, and shows that the technique is
>>> both old and still in use:
>>>
>>> - CVE-2016-8655
>>> - CVE-2017-1000112
>>> - CVE-2021-32606
>>> - CVE-2022-2588
>>> - CVE-2022-27666
>>> - CVE-2022-34918
>>> - CVE-2023-0179
>>
>> I think it would be worth pointing out how many of the above would
>> actually be aided by this patch.  The first two would not, but certainly
>> at least the can module one counts.  So I support this at least in
>> principle.  I'll take a closer look at the code hopefully tonight.
> 
> The intention was to list _only_ CVEs with exploits that would be
> mitigated by this patch. Let me go through them one by one, just using
> public exploits found with Google:
> 
> CVE-2016-8655: this uses AF_PACKET. I guess your objection is that
> AF_PACKET is rarely built as a module and even then would most certainly
> already be loaded as part of regular operations? I see at least one
> distro kernel having used CONFIG_PACKET=m in the past, so I wouldn't
> write this off completely. You need CAP_NET_RAW to create this socket
> type AFAICT, which is why the exploit uses user/network namespaces.
> 
> CVE-2017-1000112: uses AF_INET. Agreed that this would certainly be
> compiled in or already loaded, so we can drop this.

[...]

> All the exploits seem to be using user namespaces, for CVE-2017-1000112
> I think it needs it to set the MTU of a dummy interface. I'm happy to
> drop this CVE from the list (I probably looked too fast when looking at
> it), but I think the rest are legitimate. Added Andrey Konovalov to Cc
> as he wrote the exploit I looked at and can maybe confirm (and in
> general has more experience with exploits).

Oh, I see -- the request_module() call in socket() happens before the
CAP_NET_RAW check in e.g. packet_create(), which means you don't need
user namespaces to _load_ these protocols, you can just do the socket()
call in the init user namespace, which will load the module but fail
creating the socket, then call socket() again inside the user namespace,
which now succeeds because the module has been loaded...

I haven't looked at the other CVEs from this angle, I'll look over them
again and see what I can find.


Vegard
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 46e3d62c0eea..bd9b4e911a6a 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -612,6 +612,17 @@  A value of 0 means no change. The default value is 200 meaning the NMI
 watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
 
 
+ns_modules_allowed
+==================
+
+Control whether processes may trigger module loading inside a user namespace.
+
+= =================================
+0 Deny module loading requests.
+1 Accept module loading requests.
+= =================================
+
+
 numa_balancing
 ==============
 
diff --git a/init/Kconfig b/init/Kconfig
index 7e5c3ddc341d..5d9ab43a24b9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1249,6 +1249,23 @@  config USER_NS
 
 	  If unsure, say N.
 
+config NS_MODULES_ALLOWED_DEFAULT_ON
+	bool "Allow user namespaces to auto-load kernel modules by default"
+	depends on MODULES
+	depends on USER_NS
+	default y
+	help
+	  This option makes it so that processes running inside user
+	  namespaces may auto-load kernel modules.
+
+	  Say N to mitigate some exploits that rely on being able to
+	  auto-load kernel modules; however, this may also cause some
+	  legitimate programs to fail unless kernel modules are loaded by
+	  hand.
+
+	  You can write 0 or 1 to /proc/sys/kernel/ns_modules_allowed to
+	  change behaviour at run-time.
+
 config PID_NS
 	bool "PID Namespaces"
 	default y
diff --git a/kernel/kmod.c b/kernel/kmod.c
index b717134ebe17..938c0a39381a 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -25,6 +25,7 @@ 
 #include <linux/ptrace.h>
 #include <linux/async.h>
 #include <linux/uaccess.h>
+#include <linux/sysctl.h>
 
 #include <trace/events/module.h>
 
@@ -105,6 +106,12 @@  static int call_modprobe(char *module_name, int wait)
 	return -ENOMEM;
 }
 
+/*
+ * Allow processes running inside namespaces to trigger module loading?
+ */
+static bool sysctl_ns_modules_allowed __read_mostly =
+	IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON);
+
 /**
  * __request_module - try to load a kernel module
  * @wait: wait (or not) for the operation to complete
@@ -148,6 +155,16 @@  int __request_module(bool wait, const char *fmt, ...)
 	if (ret)
 		return ret;
 
+	/*
+	 * Disallow module loading if we're in a user namespace.
+	 */
+	if (current_user_ns() != &init_user_ns &&
+	    !sysctl_ns_modules_allowed) {
+		pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
+			task_pid_nr(current), current->comm, module_name);
+		return -EPERM;
+	}
+
 	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
 		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
 				    atomic_read(&kmod_concurrent_max),
@@ -175,3 +192,23 @@  int __request_module(bool wait, const char *fmt, ...)
 	return ret;
 }
 EXPORT_SYMBOL(__request_module);
+
+#ifdef CONFIG_SYSCTL
+static struct ctl_table kmod_sysctl_table[] = {
+	{
+		.procname       = "ns_modules_allowed",
+		.data           = &sysctl_ns_modules_allowed,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dobool,
+	},
+	{ }
+};
+
+static int __init kmod_sysctl_init(void)
+{
+	register_sysctl_init("kernel", kmod_sysctl_table);
+	return 0;
+}
+late_initcall(kmod_sysctl_init);
+#endif