Message ID | 20230112131911.7684-1-vegard.nossum@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] kmod: harden user namespaces with new kernel.ns_modules_allowed sysctl | expand |
On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: > Creating a new user namespace grants you the ability to reach a lot of code > (including loading certain kernel modules) that would otherwise be out of > reach of an attacker. We can reduce the attack surface and block exploits > by ensuring that user namespaces cannot trigger module (auto-)loading. > > A cursory search of exploits found online yields the following > non-exhaustive list of vulnerabilities, and shows that the technique is > both old and still in use: > > - CVE-2016-8655 > - CVE-2017-1000112 > - CVE-2021-32606 > - CVE-2022-2588 > - CVE-2022-27666 > - CVE-2022-34918 > - CVE-2023-0179 I think it would be worth pointing out how many of the above would actually be aided by this patch. The first two would not, but certainly at least the can module one counts. So I support this at least in principle. I'll take a closer look at the code hopefully tonight. > A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS > Stream, and Oracle Linux allow unprivileged user namespaces by default, > probably to support sandboxing in browsers and containers. Major > exceptions would be Debian and Arch Linux which carry an out-of-tree patch > to disable user namespaces by default. > > This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to > 0 will block requests to load modules when the request originates in a > process running in a user namespace. > > For backwards compatibility, the default value of the sysctl is set to > CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning > there should be absolutely no change in behaviour unless you opt in either > at compile time or at runtime. > > This mitigation obviously offers no protection if the vulnerable module is > already loaded, but for many of these exploits the vast majority of users > will never actually load or use these modules on purpose; in other words, > for the vast majority of users, this would block exploits for the above > list of vulnerabilities. > > Testing: Running the reproducer for CVE-2022-2588 fails and results in the > following message in the kernel log: > > [ 130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl > > v2: > - fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init() > - use .maxlen = sizeof(int) for proc_dobool() > - don't warn when sysctl_ns_modules_allowed == 1 > > v3: > - drop capable(CAP_SYS_MODULE) check > - add a new CVE to changelog :-) > - add survey of distros that enable unpriv userns to changelog > > Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1 > Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2 > Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> > Cc: Serge Hallyn <serge@hallyn.com> > Cc: Eric Biederman <ebiederm@xmission.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: linux-hardening@vger.kernel.org > Cc: linux-modules@vger.kernel.org > Cc: John Haxby <john.haxby@oracle.com> > Cc: Jann Horn <jannh@google.com> > Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> > --- > Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++ > init/Kconfig | 17 ++++++++++ > kernel/kmod.c | 37 +++++++++++++++++++++ > 3 files changed, 65 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index 46e3d62c0eea..bd9b4e911a6a 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -612,6 +612,17 @@ A value of 0 means no change. The default value is 200 meaning the NMI > watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10). > > > +ns_modules_allowed > +================== > + > +Control whether processes may trigger module loading inside a user namespace. > + > += ================================= > +0 Deny module loading requests. > +1 Accept module loading requests. > += ================================= > + > + > numa_balancing > ============== > > diff --git a/init/Kconfig b/init/Kconfig > index 7e5c3ddc341d..5d9ab43a24b9 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1249,6 +1249,23 @@ config USER_NS > > If unsure, say N. > > +config NS_MODULES_ALLOWED_DEFAULT_ON > + bool "Allow user namespaces to auto-load kernel modules by default" > + depends on MODULES > + depends on USER_NS > + default y > + help > + This option makes it so that processes running inside user > + namespaces may auto-load kernel modules. > + > + Say N to mitigate some exploits that rely on being able to > + auto-load kernel modules; however, this may also cause some > + legitimate programs to fail unless kernel modules are loaded by > + hand. > + > + You can write 0 or 1 to /proc/sys/kernel/ns_modules_allowed to > + change behaviour at run-time. > + > config PID_NS > bool "PID Namespaces" > default y > diff --git a/kernel/kmod.c b/kernel/kmod.c > index b717134ebe17..938c0a39381a 100644 > --- a/kernel/kmod.c > +++ b/kernel/kmod.c > @@ -25,6 +25,7 @@ > #include <linux/ptrace.h> > #include <linux/async.h> > #include <linux/uaccess.h> > +#include <linux/sysctl.h> > > #include <trace/events/module.h> > > @@ -105,6 +106,12 @@ static int call_modprobe(char *module_name, int wait) > return -ENOMEM; > } > > +/* > + * Allow processes running inside namespaces to trigger module loading? > + */ > +static bool sysctl_ns_modules_allowed __read_mostly = > + IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON); > + > /** > * __request_module - try to load a kernel module > * @wait: wait (or not) for the operation to complete > @@ -148,6 +155,16 @@ int __request_module(bool wait, const char *fmt, ...) > if (ret) > return ret; > > + /* > + * Disallow module loading if we're in a user namespace. > + */ > + if (current_user_ns() != &init_user_ns && > + !sysctl_ns_modules_allowed) { > + pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n", > + task_pid_nr(current), current->comm, module_name); > + return -EPERM; > + } > + > if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) { > pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...", > atomic_read(&kmod_concurrent_max), > @@ -175,3 +192,23 @@ int __request_module(bool wait, const char *fmt, ...) > return ret; > } > EXPORT_SYMBOL(__request_module); > + > +#ifdef CONFIG_SYSCTL > +static struct ctl_table kmod_sysctl_table[] = { > + { > + .procname = "ns_modules_allowed", > + .data = &sysctl_ns_modules_allowed, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dobool, > + }, > + { } > +}; > + > +static int __init kmod_sysctl_init(void) > +{ > + register_sysctl_init("kernel", kmod_sysctl_table); > + return 0; > +} > +late_initcall(kmod_sysctl_init); > +#endif > -- > 2.35.1.46.g38062e73e0
On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: > Creating a new user namespace grants you the ability to reach a lot of code > (including loading certain kernel modules) that would otherwise be out of > reach of an attacker. We can reduce the attack surface and block exploits > by ensuring that user namespaces cannot trigger module (auto-)loading. > > A cursory search of exploits found online yields the following > non-exhaustive list of vulnerabilities, and shows that the technique is > both old and still in use: > > - CVE-2016-8655 > - CVE-2017-1000112 > - CVE-2021-32606 > - CVE-2022-2588 > - CVE-2022-27666 > - CVE-2022-34918 > - CVE-2023-0179 > > A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS > Stream, and Oracle Linux allow unprivileged user namespaces by default, > probably to support sandboxing in browsers and containers. Major > exceptions would be Debian and Arch Linux which carry an out-of-tree patch > to disable user namespaces by default. > > This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to > 0 will block requests to load modules when the request originates in a > process running in a user namespace. > > For backwards compatibility, the default value of the sysctl is set to > CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning > there should be absolutely no change in behaviour unless you opt in either > at compile time or at runtime. > > This mitigation obviously offers no protection if the vulnerable module is > already loaded, but for many of these exploits the vast majority of users > will never actually load or use these modules on purpose; in other words, > for the vast majority of users, this would block exploits for the above > list of vulnerabilities. > > Testing: Running the reproducer for CVE-2022-2588 fails and results in the > following message in the kernel log: > > [ 130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl > > v2: > - fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init() > - use .maxlen = sizeof(int) for proc_dobool() > - don't warn when sysctl_ns_modules_allowed == 1 > > v3: > - drop capable(CAP_SYS_MODULE) check > - add a new CVE to changelog :-) > - add survey of distros that enable unpriv userns to changelog > > Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1 > Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2 > Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> > Cc: Serge Hallyn <serge@hallyn.com> > Cc: Eric Biederman <ebiederm@xmission.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: linux-hardening@vger.kernel.org > Cc: linux-modules@vger.kernel.org > Cc: John Haxby <john.haxby@oracle.com> > Cc: Jann Horn <jannh@google.com> > Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> > --- > Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++ > init/Kconfig | 17 ++++++++++ > kernel/kmod.c | 37 +++++++++++++++++++++ > 3 files changed, 65 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index 46e3d62c0eea..bd9b4e911a6a 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -612,6 +612,17 @@ A value of 0 means no change. The default value is 200 meaning the NMI > watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10). > > > +ns_modules_allowed > +================== > + > +Control whether processes may trigger module loading inside a user namespace. This is false documentation. The place it is trying to protect simply prevents trying to call modprobe for auto-loading within the kernel. > diff --git a/kernel/kmod.c b/kernel/kmod.c > index b717134ebe17..938c0a39381a 100644 > --- a/kernel/kmod.c > +++ b/kernel/kmod.c > @@ -25,6 +25,7 @@ > #include <linux/ptrace.h> > #include <linux/async.h> > #include <linux/uaccess.h> > +#include <linux/sysctl.h> > > #include <trace/events/module.h> > > @@ -105,6 +106,12 @@ static int call_modprobe(char *module_name, int wait) > return -ENOMEM; > } > > +/* > + * Allow processes running inside namespaces to trigger module loading? > + */ > +static bool sysctl_ns_modules_allowed __read_mostly = > + IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON); > + > /** > * __request_module - try to load a kernel module > * @wait: wait (or not) for the operation to complete > @@ -148,6 +155,16 @@ int __request_module(bool wait, const char *fmt, ...) > if (ret) > return ret; > > + /* > + * Disallow module loading if we're in a user namespace. > + */ > + if (current_user_ns() != &init_user_ns && > + !sysctl_ns_modules_allowed) { > + pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n", > + task_pid_nr(current), current->comm, module_name); > + return -EPERM; > + } > + > if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) { > pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...", > atomic_read(&kmod_concurrent_max), Have you seen what call_modprobe() does? This is just a limitting the auto-loading through calling modprobe. If the concern is to load modules wouldn't you be better off just putting a stop gap at finit_module() which actually receives the load attempt from modprobe? Ie, an evil namespace, if it has access to /sbin/modprobe could simply just try calling /sbin/modprobe on its own. Beating the royal shit out of kmod is already stress tested via tools/testing/selftests/kmod/kmod.sh in particular: tools/testing/selftests/kmod/kmod.sh -t 0008 tools/testing/selftests/kmod/kmod.sh -t 0009 What this *could* do is race to force a failure on some other *real* modprobe request we do wish to honor when the above kmod kmod_concurrent_max is triggered. So in terms of justification, this commit log needs a bit more work as I just can't see how this alone is fixing any CVE. Note: actually *racing* tons of requests at the same time could end up then *in kernel requests* being processed, and there lies another shit show which we've recently been nose diving into through the work and report by Petr Pavlu [0]. Although I have that patch in modules-next, it was too soon to send to Linus for v6.2-rc1 due to lack of testing on linux-next. You may want to test that too, as its an old regression. Even though the commit log describe honest requests for modules, surely an exploit bad namespace could in theory try to race other requests too to prefer its own frequency module, etc. So let's take a step back and think this through. What exaclty and why would this commit fix *any* security issue? Itemizing CVEs won't cut it. [0] https://lkml.kernel.org/r/20221205103557.18363-1-petr.pavlu@suse.com Luis
On 1/12/23 19:45, Luis Chamberlain wrote: > On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: >> >> +ns_modules_allowed >> +================== >> + >> +Control whether processes may trigger module loading inside a user namespace. > > This is false documentation. The place it is trying to protect simply > prevents trying to call modprobe for auto-loading within the kernel. I don't think this is false -- but yes, this only protects module auto-loading in user namespaces; all auto-loading calls within the kernel should be going through this __request_module() -> modprobe path. init_module()/finit_module(), the mechanism used by modprobe, are themselves already restricted inside user namespaces, see below. >> + /* >> + * Disallow module loading if we're in a user namespace. >> + */ >> + if (current_user_ns() != &init_user_ns && >> + !sysctl_ns_modules_allowed) { >> + pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n", >> + task_pid_nr(current), current->comm, module_name); >> + return -EPERM; >> + } >> + >> if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) { >> pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...", >> atomic_read(&kmod_concurrent_max), > > Have you seen what call_modprobe() does? Yes. > This is just a limitting the auto-loading through calling modprobe. > If the concern is to load modules wouldn't you be better off just > putting a stop gap at finit_module() which actually receives the > load attempt from modprobe? Ie, an evil namespace, if it has access > to /sbin/modprobe could simply just try calling /sbin/modprobe on its > own. No. Root inside a user namespace can't call finit_module() as it won't have the necessary capabilities in the init namespace, see may_init_module(). modprobe, on the other hand, when called by the kernel, is called through usermode helper, which runs in the init namespace as root, so it can do whatever it wants. If modprobe called by root inside a user namespace could load anything, that itself would be a security issue. But it can't, so it's not. > Beating the royal shit out of kmod is already stress tested via > tools/testing/selftests/kmod/kmod.sh in particular: > > tools/testing/selftests/kmod/kmod.sh -t 0008 > tools/testing/selftests/kmod/kmod.sh -t 0009 > > What this *could* do is race to force a failure on some other *real* > modprobe request we do wish to honor when the above kmod kmod_concurrent_max > is triggered. How? My new check is done before the kmod_concurrent_max check/critical section... the check doesn't cause any more modprobe requests to happen in the first place, the only thing it can do is make them exit early. There is no way my patch can make this worse. > So in terms of justification, this commit log needs a bit more work as I > just can't see how this alone is fixing any CVE. [...] > So let's take a step back and think this through. What exaclty and why > would this commit fix *any* security issue? Itemizing CVEs won't cut it. I can include my explanations above in the changelog if you think that will help. Vegard
On Thu, Jan 12, 2023 at 10:53:07PM +0100, Vegard Nossum wrote: > On 1/12/23 19:45, Luis Chamberlain wrote: > > On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: > > > +ns_modules_allowed > > > +================== > > > + > > > +Control whether processes may trigger module loading inside a user namespace. > > > > This is false documentation. The place it is trying to protect simply > > prevents trying to call modprobe for auto-loading within the kernel. > > I don't think this is false -- but yes, this only protects module > auto-loading in user namespaces; all auto-loading calls within the > kernel should be going through this __request_module() -> modprobe path. > > init_module()/finit_module(), the mechanism used by modprobe, are > themselves already restricted inside user namespaces, see below. The documentation should be clear about the exact nature of what mechanism is prevented to load. > > > + /* > > > + * Disallow module loading if we're in a user namespace. > > > + */ > > > + if (current_user_ns() != &init_user_ns && > > > + !sysctl_ns_modules_allowed) { > > > + pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n", > > > + task_pid_nr(current), current->comm, module_name); > > > + return -EPERM; > > > + } > > > + > > > if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) { > > > pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...", > > > atomic_read(&kmod_concurrent_max), > > > > Have you seen what call_modprobe() does? > > Yes. > > > This is just a limitting the auto-loading through calling modprobe. > > If the concern is to load modules wouldn't you be better off just > > putting a stop gap at finit_module() which actually receives the > > load attempt from modprobe? Ie, an evil namespace, if it has access > > to /sbin/modprobe could simply just try calling /sbin/modprobe on its > > own. > > No. > > Root inside a user namespace can't call finit_module() as it won't have > the necessary capabilities in the init namespace, see may_init_module(). I think the documentation to this knob you are adding should explain this as well to give a bette context as to why it is useful. And if may_init_module() suffices, why not just add a may_init_module() instead of your check? Why would we allow a successful call to modprobe with your sysctl but end up having it ignored in the end by finit_module()? If what is being allowed here is to overcome may_init_module() CAP_SYS_MODULE check by using call_modprobe() on the user namespace the commit log should just mention this, and mention that by design user namespaces have never been intended to allow loading modules, even if they somehow end up triggering auto-loading of modules via request_module(). If this is true then currently we are enabling auto-loading as a mistake, and the sysctl is only valuable to prevent regressions with existing behaviour which should have been disabled from the start. Is there an example successful exploit which takes avantage of having a user namespace auto-load a kernel module and that triggers a security flaw? What's an example of one? > > What this *could* do is race to force a failure on some other *real* > > modprobe request we do wish to honor when the above kmod kmod_concurrent_max > > is triggered. > > How? My new check is done before the kmod_concurrent_max check/critical > section... the check doesn't cause any more modprobe requests to happen > in the first place, the only thing it can do is make them exit early. > There is no way my patch can make this worse. You are missing my point. My point is that abuse in general over having a user namespace calling modprobe could cause harm that it can possibly Denial-of-Service valid requests. If user namespaces should never have been allowed to even auto-load the commit log should mention that. > > So in terms of justification, this commit log needs a bit more work as I > > just can't see how this alone is fixing any CVE. > > [...] > > > So let's take a step back and think this through. What exaclty and why > > would this commit fix *any* security issue? Itemizing CVEs won't cut it. > > I can include my explanations above in the changelog if you think that > will help. Answering the above questions would help. Luis
On 1/12/23 19:00, Serge E. Hallyn wrote: > On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: >> Creating a new user namespace grants you the ability to reach a lot of code >> (including loading certain kernel modules) that would otherwise be out of >> reach of an attacker. We can reduce the attack surface and block exploits >> by ensuring that user namespaces cannot trigger module (auto-)loading. >> >> A cursory search of exploits found online yields the following >> non-exhaustive list of vulnerabilities, and shows that the technique is >> both old and still in use: >> >> - CVE-2016-8655 >> - CVE-2017-1000112 >> - CVE-2021-32606 >> - CVE-2022-2588 >> - CVE-2022-27666 >> - CVE-2022-34918 >> - CVE-2023-0179 > > I think it would be worth pointing out how many of the above would > actually be aided by this patch. The first two would not, but certainly > at least the can module one counts. So I support this at least in > principle. I'll take a closer look at the code hopefully tonight. The intention was to list _only_ CVEs with exploits that would be mitigated by this patch. Let me go through them one by one, just using public exploits found with Google: CVE-2016-8655: this uses AF_PACKET. I guess your objection is that AF_PACKET is rarely built as a module and even then would most certainly already be loaded as part of regular operations? I see at least one distro kernel having used CONFIG_PACKET=m in the past, so I wouldn't write this off completely. You need CAP_NET_RAW to create this socket type AFAICT, which is why the exploit uses user/network namespaces. CVE-2017-1000112: uses AF_INET. Agreed that this would certainly be compiled in or already loaded, so we can drop this. CVE-2021-32606: needs CONFIG_CAN_ISOTP=m/can-isotp. CVE-2022-2588: needs CONFIG_NET_CLS_ROUTE4=m/cls_route. CVE-2022-27666: needs CONFIG_INET6_ESP=m/esp6. CVE-2022-34918: needs CONFIG_NF_TABLES=m/nf_tables at least. CVE-2023-0179: needs CONFIG_NF_TABLES=m/nf_tables at least. All the exploits seem to be using user namespaces, for CVE-2017-1000112 I think it needs it to set the MTU of a dummy interface. I'm happy to drop this CVE from the list (I probably looked too fast when looking at it), but I think the rest are legitimate. Added Andrey Konovalov to Cc as he wrote the exploit I looked at and can maybe confirm (and in general has more experience with exploits). Thanks again for looking at this -- I will try to address Luis's comments in this thread as well and send out a v4 when we agree on the required changes. Vegard
On 3/1/23 12:09, Vegard Nossum wrote: > > On 1/12/23 19:00, Serge E. Hallyn wrote: >> On Thu, Jan 12, 2023 at 02:19:11PM +0100, Vegard Nossum wrote: >>> Creating a new user namespace grants you the ability to reach a lot >>> of code >>> (including loading certain kernel modules) that would otherwise be >>> out of >>> reach of an attacker. We can reduce the attack surface and block >>> exploits >>> by ensuring that user namespaces cannot trigger module (auto-)loading. >>> >>> A cursory search of exploits found online yields the following >>> non-exhaustive list of vulnerabilities, and shows that the technique is >>> both old and still in use: >>> >>> - CVE-2016-8655 >>> - CVE-2017-1000112 >>> - CVE-2021-32606 >>> - CVE-2022-2588 >>> - CVE-2022-27666 >>> - CVE-2022-34918 >>> - CVE-2023-0179 >> >> I think it would be worth pointing out how many of the above would >> actually be aided by this patch. The first two would not, but certainly >> at least the can module one counts. So I support this at least in >> principle. I'll take a closer look at the code hopefully tonight. > > The intention was to list _only_ CVEs with exploits that would be > mitigated by this patch. Let me go through them one by one, just using > public exploits found with Google: > > CVE-2016-8655: this uses AF_PACKET. I guess your objection is that > AF_PACKET is rarely built as a module and even then would most certainly > already be loaded as part of regular operations? I see at least one > distro kernel having used CONFIG_PACKET=m in the past, so I wouldn't > write this off completely. You need CAP_NET_RAW to create this socket > type AFAICT, which is why the exploit uses user/network namespaces. > > CVE-2017-1000112: uses AF_INET. Agreed that this would certainly be > compiled in or already loaded, so we can drop this. [...] > All the exploits seem to be using user namespaces, for CVE-2017-1000112 > I think it needs it to set the MTU of a dummy interface. I'm happy to > drop this CVE from the list (I probably looked too fast when looking at > it), but I think the rest are legitimate. Added Andrey Konovalov to Cc > as he wrote the exploit I looked at and can maybe confirm (and in > general has more experience with exploits). Oh, I see -- the request_module() call in socket() happens before the CAP_NET_RAW check in e.g. packet_create(), which means you don't need user namespaces to _load_ these protocols, you can just do the socket() call in the init user namespace, which will load the module but fail creating the socket, then call socket() again inside the user namespace, which now succeeds because the module has been loaded... I haven't looked at the other CVEs from this angle, I'll look over them again and see what I can find. Vegard
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 46e3d62c0eea..bd9b4e911a6a 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -612,6 +612,17 @@ A value of 0 means no change. The default value is 200 meaning the NMI watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10). +ns_modules_allowed +================== + +Control whether processes may trigger module loading inside a user namespace. + += ================================= +0 Deny module loading requests. +1 Accept module loading requests. += ================================= + + numa_balancing ============== diff --git a/init/Kconfig b/init/Kconfig index 7e5c3ddc341d..5d9ab43a24b9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1249,6 +1249,23 @@ config USER_NS If unsure, say N. +config NS_MODULES_ALLOWED_DEFAULT_ON + bool "Allow user namespaces to auto-load kernel modules by default" + depends on MODULES + depends on USER_NS + default y + help + This option makes it so that processes running inside user + namespaces may auto-load kernel modules. + + Say N to mitigate some exploits that rely on being able to + auto-load kernel modules; however, this may also cause some + legitimate programs to fail unless kernel modules are loaded by + hand. + + You can write 0 or 1 to /proc/sys/kernel/ns_modules_allowed to + change behaviour at run-time. + config PID_NS bool "PID Namespaces" default y diff --git a/kernel/kmod.c b/kernel/kmod.c index b717134ebe17..938c0a39381a 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -25,6 +25,7 @@ #include <linux/ptrace.h> #include <linux/async.h> #include <linux/uaccess.h> +#include <linux/sysctl.h> #include <trace/events/module.h> @@ -105,6 +106,12 @@ static int call_modprobe(char *module_name, int wait) return -ENOMEM; } +/* + * Allow processes running inside namespaces to trigger module loading? + */ +static bool sysctl_ns_modules_allowed __read_mostly = + IS_BUILTIN(CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON); + /** * __request_module - try to load a kernel module * @wait: wait (or not) for the operation to complete @@ -148,6 +155,16 @@ int __request_module(bool wait, const char *fmt, ...) if (ret) return ret; + /* + * Disallow module loading if we're in a user namespace. + */ + if (current_user_ns() != &init_user_ns && + !sysctl_ns_modules_allowed) { + pr_warn_ratelimited("request_module: pid %d (%s) in user namespace requested kernel module %s; denied due to kernel.ns_modules_allowed sysctl\n", + task_pid_nr(current), current->comm, module_name); + return -EPERM; + } + if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) { pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...", atomic_read(&kmod_concurrent_max), @@ -175,3 +192,23 @@ int __request_module(bool wait, const char *fmt, ...) return ret; } EXPORT_SYMBOL(__request_module); + +#ifdef CONFIG_SYSCTL +static struct ctl_table kmod_sysctl_table[] = { + { + .procname = "ns_modules_allowed", + .data = &sysctl_ns_modules_allowed, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dobool, + }, + { } +}; + +static int __init kmod_sysctl_init(void) +{ + register_sysctl_init("kernel", kmod_sysctl_table); + return 0; +} +late_initcall(kmod_sysctl_init); +#endif
Creating a new user namespace grants you the ability to reach a lot of code (including loading certain kernel modules) that would otherwise be out of reach of an attacker. We can reduce the attack surface and block exploits by ensuring that user namespaces cannot trigger module (auto-)loading. A cursory search of exploits found online yields the following non-exhaustive list of vulnerabilities, and shows that the technique is both old and still in use: - CVE-2016-8655 - CVE-2017-1000112 - CVE-2021-32606 - CVE-2022-2588 - CVE-2022-27666 - CVE-2022-34918 - CVE-2023-0179 A quick survey of common distros shows that Ubuntu, Fedora, RHEL, CentOS Stream, and Oracle Linux allow unprivileged user namespaces by default, probably to support sandboxing in browsers and containers. Major exceptions would be Debian and Arch Linux which carry an out-of-tree patch to disable user namespaces by default. This patch adds a new sysctl, kernel.ns_modules_allowed, which when set to 0 will block requests to load modules when the request originates in a process running in a user namespace. For backwards compatibility, the default value of the sysctl is set to CONFIG_NS_MODULES_ALLOWED_DEFAULT_ON, which in turn defaults to 1, meaning there should be absolutely no change in behaviour unless you opt in either at compile time or at runtime. This mitigation obviously offers no protection if the vulnerable module is already loaded, but for many of these exploits the vast majority of users will never actually load or use these modules on purpose; in other words, for the vast majority of users, this would block exploits for the above list of vulnerabilities. Testing: Running the reproducer for CVE-2022-2588 fails and results in the following message in the kernel log: [ 130.208030] request_module: pid 4107 (a.out) requested kernel module rtnl-link-dummy; denied due to kernel.ns_modules_allowed sysctl v2: - fix build failure due to missing CONFIG_SYSCTL guard around register_sysctl_init() - use .maxlen = sizeof(int) for proc_dobool() - don't warn when sysctl_ns_modules_allowed == 1 v3: - drop capable(CAP_SYS_MODULE) check - add a new CVE to changelog :-) - add survey of distros that enable unpriv userns to changelog Link: https://lore.kernel.org/lkml/20220809185229.28417-1-vegard.nossum@oracle.com/ # v1 Link: https://lore.kernel.org/lkml/20220815082753.6088-1-vegard.nossum@oracle.com/ # v2 Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-hardening@vger.kernel.org Cc: linux-modules@vger.kernel.org Cc: John Haxby <john.haxby@oracle.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> --- Documentation/admin-guide/sysctl/kernel.rst | 11 ++++++ init/Kconfig | 17 ++++++++++ kernel/kmod.c | 37 +++++++++++++++++++++ 3 files changed, 65 insertions(+)