capabilities: new kernel.ns_modules_allowed sysctl

Message ID	20220809185229.28417-1-vegard.nossum@oracle.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-hardening-owner@kernel.org> From: Vegard Nossum <vegard.nossum@oracle.com> To: linux-kernel@vger.kernel.org Cc: Vegard Nossum <vegard.nossum@oracle.com>, Thadeu Lima de Souza Cascardo <cascardo@canonical.com>, Serge Hallyn <serge@hallyn.com>, Eric Biederman <ebiederm@xmission.com>, Kees Cook <keescook@chromium.org>, linux-hardening@vger.kernel.org, John Haxby <john.haxby@oracle.com> Subject: [PATCH] capabilities: new kernel.ns_modules_allowed sysctl Date: Tue, 9 Aug 2022 20:52:29 +0200 Message-Id: <20220809185229.28417-1-vegard.nossum@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	capabilities: new kernel.ns_modules_allowed sysctl \| expand capabilities: new kernel.ns_modules_allowed sysctl

Vegard Nossum Aug. 9, 2022, 6:52 p.m. UTC

Creating a new user namespace grants you the ability to reach a lot of code
(including loading certain kernel modules) that would otherwise be out of
reach of an attacker. We can reduce the attack surface and block exploits
by ensuring that user namespaces cannot trigger module (auto-)loading.

A cursory search of exploits found online yields the following extremely
non-exhaustive list of vulnerabilities, and shows that the technique is
both old and still in use:

- CVE-2016-8655
- CVE-2017-1000112
- CVE-2021-32606
- CVE-2022-2588
- CVE-2022-27666
- CVE-2022-34918

This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
0 will block requests to load modules when the request originates in a
process running in a user namespace.

For backwards compatibility, the default value of the sysctl is set to
CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
there should be absolutely no change in behaviour unless you opt in either
at compile time or at runtime.

This mitigation obviously offers no protection if the vulnerable module is
already loaded, but for many of these exploits the vast majority of users
will never actually load or use these modules on purpose; in other words,
for the vast majority of users, this would block exploits for the above
list of vulnerabilities.

Testing: Running the reproducer for CVE-2022-2588 fails and results in the
following message in the kernel log:

    [  130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl

Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-hardening@vger.kernel.org
Cc: John Haxby <john.haxby@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++
 init/Kconfig                                | 17 +++++++++
 kernel/kmod.c                               | 39 +++++++++++++++++++++
 3 files changed, 67 insertions(+)

Kees Cook Aug. 9, 2022, 10:56 p.m. UTC | #1

On Tue, Aug 09, 2022 at 08:52:29PM +0200, Vegard Nossum wrote:
> Creating a new user namespace grants you the ability to reach a lot of code
> (including loading certain kernel modules) that would otherwise be out of
> reach of an attacker. We can reduce the attack surface and block exploits
> by ensuring that user namespaces cannot trigger module (auto-)loading.
> 
> A cursory search of exploits found online yields the following extremely
> non-exhaustive list of vulnerabilities, and shows that the technique is
> both old and still in use:
> 
> - CVE-2016-8655
> - CVE-2017-1000112
> - CVE-2021-32606
> - CVE-2022-2588
> - CVE-2022-27666
> - CVE-2022-34918
> 
> This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
> 0 will block requests to load modules when the request originates in a
> process running in a user namespace.
> 
> For backwards compatibility, the default value of the sysctl is set to
> CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
> there should be absolutely no change in behaviour unless you opt in either
> at compile time or at runtime.
> 
> This mitigation obviously offers no protection if the vulnerable module is
> already loaded, but for many of these exploits the vast majority of users
> will never actually load or use these modules on purpose; in other words,
> for the vast majority of users, this would block exploits for the above
> list of vulnerabilities.

We've needed better module autoloading protections for a long time[1].
This patch is a big hammer ("all user namespaces"), so I worry it
wouldn't actually get used much.

Here's a pointer into a prior thread, where Linus chimed in[2].
I replied back then, but I'm not sure I agree with my 2017 self any
more. :P

It really does feel like the loading decisions need to be made by the
userspace helper, which currently doesn't have enough information to
make those choices.

-Kees

[1] https://github.com/KSPP/linux/issues/24
[2] https://lore.kernel.org/kernel-hardening/CA+55aFxiDKfe6VCM+aV2OgnkzMpP+iz+rn2k25_Qa_QLex=pPQ@mail.gmail.com/

> 
> Testing: Running the reproducer for CVE-2022-2588 fails and results in the
> following message in the kernel log:
> 
>     [  130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl
> 
> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-hardening@vger.kernel.org
> Cc: John Haxby <john.haxby@oracle.com>
> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++
>  init/Kconfig                                | 17 +++++++++
>  kernel/kmod.c                               | 39 +++++++++++++++++++++
>  3 files changed, 67 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index ddccd10774623..551de7bce836c 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -592,6 +592,17 @@ to the guest kernel command line (see
>  Documentation/admin-guide/kernel-parameters.rst).
>  
>  
> +ns_modules_allowed
> +==================
> +
> +Control whether processes may trigger module loading inside a user namespace.
> +
> += =================================
> +0 Deny module loading requests.
> +1 Accept module loading requests.
> += =================================
> +
> +
>  numa_balancing
>  ==============
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index c984afc489dea..6734373995936 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1226,6 +1226,23 @@ config USER_NS
>  
>  	  If unsure, say N.
>  
> +config NS_MODULES_ALLOWED_DEFAULT_ON
> +	bool "Allow user namespaces to auto-load kernel modules by default"
> +	depends on MODULES
> +	depends on USER_NS
> +	default y
> +	help
> +	  This option makes it so that processes running inside user
> +	  namespaces may auto-load kernel modules.
> +
> +	  Say N to mitigate some exploits that rely on being able to
> +	  auto-load kernel modules; however, this may also cause some
> +	  legitimate programs to fail unless kernel modules are loaded by
> +	  hand.
> +
> +	  You can write 0 or 1 to /proc/sys/kernel/ns_modules_allowed to
> +	  change behaviour at run-time.
> +
>  config PID_NS
>  	bool "PID Namespaces"
>  	default y
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index b717134ebe170..53e26009410ef 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -105,6 +105,12 @@ static int call_modprobe(char *module_name, int wait)
>  	return -ENOMEM;
>  }
>  
> +/*
> + * Allow processes running inside namespaces to trigger module loading?
> + */
> +static bool sysctl_ns_modules_allowed __read_mostly =
> +	IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON);
> +
>  /**
>   * __request_module - try to load a kernel module
>   * @wait: wait (or not) for the operation to complete
> @@ -148,6 +154,21 @@ int __request_module(bool wait, const char *fmt, ...)
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * Disallow if we're in a user namespace and we don't have
> +	 * CAP_SYS_MODULE in the init namespace.
> +	 */
> +	if (current_user_ns() != &init_user_ns && !capable(CAP_SYS_MODULE)) {
> +		if (sysctl_ns_modules_allowed) {
> +			pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s\n",
> +				task_pid_nr(current), current->comm, module_name);
> +		} else {
> +			pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n",
> +				task_pid_nr(current), current->comm, module_name);
> +			return -EPERM;
> +		}
> +	}
> +
>  	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
>  		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
>  				    atomic_read(&kmod_concurrent_max),
> @@ -175,3 +196,21 @@ int __request_module(bool wait, const char *fmt, ...)
>  	return ret;
>  }
>  EXPORT_SYMBOL(__request_module);
> +
> +static struct ctl_table kmod_sysctl_table[] = {
> +	{
> +		.procname       = "ns_modules_allowed",
> +		.data           = &sysctl_ns_modules_allowed,
> +		.maxlen         = sizeof(sysctl_ns_modules_allowed),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dobool,
> +	},
> +	{ }
> +};
> +
> +static int __init kmod_sysctl_init(void)
> +{
> +	register_sysctl_init("kernel", kmod_sysctl_table);
> +	return 0;
> +}
> +late_initcall(kmod_sysctl_init);
> -- 
> 2.35.1.46.g38062e73e0
>

Vegard Nossum Aug. 10, 2022, 8:25 a.m. UTC | #2

On 8/10/22 00:56, Kees Cook wrote:
> On Tue, Aug 09, 2022 at 08:52:29PM +0200, Vegard Nossum wrote:
>> Creating a new user namespace grants you the ability to reach a lot of code
>> (including loading certain kernel modules) that would otherwise be out of
>> reach of an attacker. We can reduce the attack surface and block exploits
>> by ensuring that user namespaces cannot trigger module (auto-)loading.
>>
>> A cursory search of exploits found online yields the following extremely
>> non-exhaustive list of vulnerabilities, and shows that the technique is
>> both old and still in use:
>>
>> - CVE-2016-8655
>> - CVE-2017-1000112
>> - CVE-2021-32606
>> - CVE-2022-2588
>> - CVE-2022-27666
>> - CVE-2022-34918
>>
>> This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
>> 0 will block requests to load modules when the request originates in a
>> process running in a user namespace.
>>
>> For backwards compatibility, the default value of the sysctl is set to
>> CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
>> there should be absolutely no change in behaviour unless you opt in either
>> at compile time or at runtime.
>>
>> This mitigation obviously offers no protection if the vulnerable module is
>> already loaded, but for many of these exploits the vast majority of users
>> will never actually load or use these modules on purpose; in other words,
>> for the vast majority of users, this would block exploits for the above
>> list of vulnerabilities.
> 
> We've needed better module autoloading protections for a long time[1].
> This patch is a big hammer ("all user namespaces"), so I worry it
> wouldn't actually get used much.
> 
> Here's a pointer into a prior thread, where Linus chimed in[2].
> I replied back then, but I'm not sure I agree with my 2017 self any
> more. :P
> 
> It really does feel like the loading decisions need to be made by the
> userspace helper, which currently doesn't have enough information to
> make those choices.
> 
> -Kees
> 
> [1] https://github.com/KSPP/linux/issues/24
> [2] https://lore.kernel.org/kernel-hardening/CA+55aFxiDKfe6VCM+aV2OgnkzMpP+iz+rn2k25_Qa_QLex=pPQ@mail.gmail.com/

Thanks for the pointers, I didn't have any of this context.

I would still argue for my patch with the following points:

1) As you said, it's been almost 7 years since the discussion you linked
and apparently it's still a problem (including those 5 privilege
escalation CVEs from my changelog); this relatively simple patch
provides a mitigation _today_

2) it can be layered with any other future mitigations if they do show up

3) it's not as big a hammer as completely disabling unprivileged user
namespaces, which seems to be the next best thing currently in terms of
protecting your users (as a distro)

4) both the implementation and the user interface are fairly simple in
my patch, which means it's not a huge long term maintenance burden like
block-/allowlists or capabilities based on whether modules are
maintained or not (I would also argue that "maintained or not" is not a
great proxy for whether there are security issues in the code)

5) it resembles other sysctls like unprivileged_bpf_disabled or
perf_event_paranoid, or even modules_disabled

6) it's opt-in by default, and even then, if you run into problems with
containers that don't work or whatever, the solution is extremely
simple: just load the modules you need before starting your container
(the module names are printed in the kernel log so it shouldn't be
difficult to track down issues)

What's the downside..?


Vegard

kernel test robot Aug. 10, 2022, 9:54 p.m. UTC | #3

Hi Vegard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kees/for-next/pstore]
[also build test ERROR on v5.19]
[cannot apply to linus/master next-20220810]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vegard-Nossum/capabilities-new-kernel-ns_modules_allowed-sysctl/20220810-031142
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/pstore
config: arm64-buildonly-randconfig-r006-20220810 (https://download.01.org/0day-ci/archive/20220811/202208110524.fN5PNDSo-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/bd78b69455d4b3cac70812bf23a27de310e813cd
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Vegard-Nossum/capabilities-new-kernel-ns_modules_allowed-sysctl/20220810-031142
        git checkout bd78b69455d4b3cac70812bf23a27de310e813cd
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> kernel/kmod.c:213:2: error: call to undeclared function 'register_sysctl_init'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           register_sysctl_init("kernel", kmod_sysctl_table);
           ^
   1 error generated.


vim +/register_sysctl_init +213 kernel/kmod.c

   210	
   211	static int __init kmod_sysctl_init(void)
   212	{
 > 213		register_sysctl_init("kernel", kmod_sysctl_table);

kernel test robot Aug. 10, 2022, 11:35 p.m. UTC | #4

Hi Vegard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kees/for-next/pstore]
[also build test ERROR on v5.19]
[cannot apply to linus/master next-20220810]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vegard-Nossum/capabilities-new-kernel-ns_modules_allowed-sysctl/20220810-031142
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/pstore
config: alpha-randconfig-r032-20220810 (https://download.01.org/0day-ci/archive/20220811/202208110734.tH4Z51iL-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/bd78b69455d4b3cac70812bf23a27de310e813cd
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Vegard-Nossum/capabilities-new-kernel-ns_modules_allowed-sysctl/20220810-031142
        git checkout bd78b69455d4b3cac70812bf23a27de310e813cd
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/kmod.c: In function 'kmod_sysctl_init':
>> kernel/kmod.c:213:9: error: implicit declaration of function 'register_sysctl_init'; did you mean 'register_sysctl_base'? [-Werror=implicit-function-declaration]
     213 |         register_sysctl_init("kernel", kmod_sysctl_table);
         |         ^~~~~~~~~~~~~~~~~~~~
         |         register_sysctl_base
   cc1: some warnings being treated as errors


vim +213 kernel/kmod.c

   210	
   211	static int __init kmod_sysctl_init(void)
   212	{
 > 213		register_sysctl_init("kernel", kmod_sysctl_table);

Kees Cook Aug. 12, 2022, 6:48 p.m. UTC | #5

On Wed, Aug 10, 2022 at 10:25:17AM +0200, Vegard Nossum wrote:
> 
> On 8/10/22 00:56, Kees Cook wrote:
> > On Tue, Aug 09, 2022 at 08:52:29PM +0200, Vegard Nossum wrote:
> >> Creating a new user namespace grants you the ability to reach a lot of code
> >> (including loading certain kernel modules) that would otherwise be out of
> >> reach of an attacker. We can reduce the attack surface and block exploits
> >> by ensuring that user namespaces cannot trigger module (auto-)loading.
> >>
> >> A cursory search of exploits found online yields the following extremely
> >> non-exhaustive list of vulnerabilities, and shows that the technique is
> >> both old and still in use:
> >>
> >> - CVE-2016-8655
> >> - CVE-2017-1000112
> >> - CVE-2021-32606
> >> - CVE-2022-2588
> >> - CVE-2022-27666
> >> - CVE-2022-34918
> >>
> >> This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to
> >> 0 will block requests to load modules when the request originates in a
> >> process running in a user namespace.
> >>
> >> For backwards compatibility, the default value of the sysctl is set to
> >> CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning
> >> there should be absolutely no change in behaviour unless you opt in either
> >> at compile time or at runtime.
> >>
> >> This mitigation obviously offers no protection if the vulnerable module is
> >> already loaded, but for many of these exploits the vast majority of users
> >> will never actually load or use these modules on purpose; in other words,
> >> for the vast majority of users, this would block exploits for the above
> >> list of vulnerabilities.
> > 
> > We've needed better module autoloading protections for a long time[1].
> > This patch is a big hammer ("all user namespaces"), so I worry it
> > wouldn't actually get used much.
> > 
> > Here's a pointer into a prior thread, where Linus chimed in[2].
> > I replied back then, but I'm not sure I agree with my 2017 self any
> > more. :P
> > 
> > It really does feel like the loading decisions need to be made by the
> > userspace helper, which currently doesn't have enough information to
> > make those choices.
> > 
> > -Kees
> > 
> > [1] https://github.com/KSPP/linux/issues/24
> > [2] https://lore.kernel.org/kernel-hardening/CA+55aFxiDKfe6VCM+aV2OgnkzMpP+iz+rn2k25_Qa_QLex=pPQ@mail.gmail.com/
> 
> Thanks for the pointers, I didn't have any of this context.
> 
> I would still argue for my patch with the following points:
> 
> 1) As you said, it's been almost 7 years since the discussion you linked
> and apparently it's still a problem (including those 5 privilege
> escalation CVEs from my changelog); this relatively simple patch
> provides a mitigation _today_
> 
> 2) it can be layered with any other future mitigations if they do show up
> 
> 3) it's not as big a hammer as completely disabling unprivileged user
> namespaces, which seems to be the next best thing currently in terms of
> protecting your users (as a distro)
> 
> 4) both the implementation and the user interface are fairly simple in
> my patch, which means it's not a huge long term maintenance burden like
> block-/allowlists or capabilities based on whether modules are
> maintained or not (I would also argue that "maintained or not" is not a
> great proxy for whether there are security issues in the code)
> 
> 5) it resembles other sysctls like unprivileged_bpf_disabled or
> perf_event_paranoid, or even modules_disabled
> 
> 6) it's opt-in by default, and even then, if you run into problems with
> containers that don't work or whatever, the solution is extremely
> simple: just load the modules you need before starting your container
> (the module names are printed in the kernel log so it shouldn't be
> difficult to track down issues)
> 
> What's the downside..?

I agree, it'd be nice to have. I'm just trying to predict what kind of
push-back there may be.

Can you address the build failures noted on the thread, and send a v2? I
note that after this patch it looks like all module loading from a userns
gets logged, regardless of the setting. Is that intended?

-Kees

Vegard Nossum Aug. 15, 2022, 8:33 a.m. UTC | #6

On 8/12/22 20:48, Kees Cook wrote:
> On Wed, Aug 10, 2022 at 10:25:17AM +0200, Vegard Nossum wrote:
>>
>> On 8/10/22 00:56, Kees Cook wrote:
>>> On Tue, Aug 09, 2022 at 08:52:29PM +0200, Vegard Nossum wrote:
>>>> Creating a new user namespace grants you the ability to reach a lot of code
>>>> (including loading certain kernel modules) that would otherwise be out of
>>>> reach of an attacker. We can reduce the attack surface and block exploits
>>>> by ensuring that user namespaces cannot trigger module (auto-)loading.
>>>>

[...]

> I agree, it'd be nice to have. I'm just trying to predict what kind of
> push-back there may be.
> 
> Can you address the build failures noted on the thread, and send a v2?

Did just now:

https://lore.kernel.org/all/20220815082753.6088-1-vegard.nossum@oracle.com/

> I
> note that after this patch it looks like all module loading from a userns
> gets logged, regardless of the setting. Is that intended?

Yeah, I thought it was useful to know even when the sysctl was disabled
but I've removed it in v2 so the patch is less intrusive. I guess it can
always be added later if it actually serves a purpose.

Thanks,


Vegard

capabilities: new kernel.ns_modules_allowed sysctl

Commit Message

Comments

Patch