diff mbox

[v5,next,5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

Message ID 1511803118-2552-6-git-send-email-tixxdz@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Djalal Harouni Nov. 27, 2017, 5:18 p.m. UTC
This uses the new request_module_cap() facility to directly propagate
CAP_NET_ADMIN capability and the 'netdev' module prefix to the
capability subsystem as it was suggested.

We do not remove the explicit capable(CAP_NET_ADMIN) check here, but we
may remove it in future versions since it is also performed by the
capability subsystem. This allows to have a better interface where other
subsystems will just use this call and let the capability subsystem
handles the permission checks, if the modules should be loaded or not.

This is also an infrastructure fix since historically Linux always
allowed to auto-load modules without privileges, and later the net code
started to check capabilities and prefixes, adapted the CAP_NET_ADMIN
check with the 'netdev' prefix to prevent abusing the capability by
loading non-netdev modules. However from a bigger picture we want to
continue to support automatic module loading as non privileged but also
implement easy policy solutions like:

User=djalal
DenyNewFeatures=no

Which will translate to allow the interactive user djalal to load extra
Linux features. Others, volatile accounts or guests can be easily
blocked from doing so. We have introduced in previous patches the
necessary infrastructure and now with this change we start to use the
new request_module_cap() function to explicitly tell the capability
subsystem that we want to auto-load modules with CAP_NET_ADMIN if they
are prefixed.

This is also based on suggestions from Rusty Russel and Kees Cook [1]

[1] https://lkml.org/lkml/2017/4/26/735

Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Andy Lutomirski <luto@kernel.org>
Suggested-by: Rusty Russell <rusty@rustcorp.com.au>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
---
 net/core/dev_ioctl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Nov. 27, 2017, 6:44 p.m. UTC | #1
On Mon, Nov 27, 2017 at 9:18 AM, Djalal Harouni <tixxdz@gmail.com> wrote:
> This uses the new request_module_cap() facility to directly propagate
> CAP_NET_ADMIN capability and the 'netdev' module prefix to the
> capability subsystem as it was suggested.

This is the kind of complexity that I wonder if it's worth it at all.

Nobody sane actually uses those stupid capability bits. Have you ever
actually seen it used in real life?

They were a mistake, and we should never have done them - another case
of security people who think that complexity == security, when in
reality nobody actually wants the complexity or is willing to set it
up and manage it.

                   Linus
Djalal Harouni Nov. 27, 2017, 9:41 p.m. UTC | #2
Hi Linus,

On Mon, Nov 27, 2017 at 7:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 27, 2017 at 9:18 AM, Djalal Harouni <tixxdz@gmail.com> wrote:
>> This uses the new request_module_cap() facility to directly propagate
>> CAP_NET_ADMIN capability and the 'netdev' module prefix to the
>> capability subsystem as it was suggested.
>
> This is the kind of complexity that I wonder if it's worth it at all.
>
> Nobody sane actually uses those stupid capability bits. Have you ever
> actually seen it used in real life?

Yes they are complicated even for developers, and normal users do not
understand them, however yes every sandbox and container is exposing
them to endusers directly, they are documented!  so yes CAP_SYS_MODULE
is exposed but it does not cover autoloading.

However, we are trying hard to abstract some semantics that are easy
to grasp, we are mutating capabilities and seccomp to have an
abstracted "yes/no" options for our endusers.

Now, if you are referring to kernel code, the networking subsystem is
using them and I don't want to break any assumption here. There is
still the request_module(), the request_module_cap() was suggested so
networking code later won't have to do the checks on its own, and
maybe it can be consistent in the long term. The phonet sockets even
needs CAP_SYS_ADMIN...


>
> They were a mistake, and we should never have done them - another case
> of security people who think that complexity == security, when in
> reality nobody actually wants the complexity or is willing to set it
> up and manage it.

Alright, but I guess we are stuck, is there something better on how we
can do this or describe this ?


Please note in these patches, the mode is specifically described as:

* allowed: for backward compatibility  (I would have done without it)
* privileged: which includes capabilities (backward compatibility too)
or we can add what ever in the future
* disabled: even for privileged.

So I would have preferred if it is something like "yes/no" but...
However in userspace we will try hard to hide this complexity and the
capability bits.

Now I can see that the code comments and doc refer to privileged with
capabilities a lot, where we can maybe update that doc and code to
less state that privileged means capabilities ? Suggestions ?

Thanks!

>                    Linus
Linus Torvalds Nov. 27, 2017, 10:04 p.m. UTC | #3
On Mon, Nov 27, 2017 at 1:41 PM, Djalal Harouni <tixxdz@gmail.com> wrote:
>
> However, we are trying hard to abstract some semantics that are easy
> to grasp, we are mutating capabilities and seccomp to have an
> abstracted "yes/no" options for our endusers.

Yes.

Sadly, it looks like we actually do have users that just expect to
load modules dynamically without any capabilities at all.

So we can't actually disallow it by default at all, which imho makes
this security option essentially useless.

A security option that people can't use without breaking their system
is pointless.

We saw that with SELinux - people ended up just disabling it for
_years_, simply because it ended up breaking so much in practice. And
yes, it got fixed eventually, but at an incredibly high maintenance
cost of all the crazy rules databases.

> Alright, but I guess we are stuck, is there something better on how we
> can do this or describe this ?

So I wonder if we can perhaps look at the places that actually do
"requerst_module()", and start filtering them on that basis.

Some of them will already have checked for capabilities.

Others clearly expect to juist work even _without_ capabilities (ie
the bluetoothd case).

So the whole "let's add a global config option" model is broken. There
is no possible global rule. It will break things, which in turn mean
that people won't turn it on (and we can't turn it on by default),
which in turn makes this pointless.

In other words, I really think that anything that just adds a mode
flag cannot work.

So instead of having one "modules_autoload_mode" thing, maybe the
individual requerst_module() cases need to simply be audited.

Put another way: I think the part of your patch series that does that
"request_module_cap()" and makes the netdev modules use it is a good
addition.

It's the "mode" part I really don't agree with, because apparently we
really need to default it to permissive.

So how about instead:

 - add that "request_module_cap()" and make the networking code that
already uses CAP_ADMIN_NET use it.

 - make "request_module()" itself default to being
"request_module_cap(CAP_SYS_MODULE,..)"

 - make sure that when the capability check fails, we print an error
message, and then for the ones that trigger, we will audit them and
see if it's ok.

Because that "mode" flag defaulting to off will just mean that the
default case will remain the existing unsafe one, and that's bad.

Opt-in really doesn't work. We've done it.

Global flags for varied behavior really doesn't work. We've done that
too. Different cases want different behavior, the global flag is just
fundamentally broken.

              Linus
Kees Cook Nov. 27, 2017, 10:59 p.m. UTC | #4
On Mon, Nov 27, 2017 at 2:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 27, 2017 at 1:41 PM, Djalal Harouni <tixxdz@gmail.com> wrote:
>>
>> However, we are trying hard to abstract some semantics that are easy
>> to grasp, we are mutating capabilities and seccomp to have an
>> abstracted "yes/no" options for our endusers.
>
> Yes.
>
> Sadly, it looks like we actually do have users that just expect to
> load modules dynamically without any capabilities at all.
>
> So we can't actually disallow it by default at all, which imho makes
> this security option essentially useless.

The good news here is that this series comes with an expected user:
systemd, for which I suspect this will option will get wider and wider
use.

> A security option that people can't use without breaking their system
> is pointless.

Another bit of good news is that systems _depending_ cap-less module
loading are uncommon enough that many system builders can enable this
protection and nothing will break. (Even you thought it was rare
enough that we could just make this default-enabled.)

Besides, distros understand that they have to keep an eye out for
these things since the kernel has a long history of not allowing
default-enabled protections that change userspace behavior, etc.

> So I wonder if we can perhaps look at the places that actually do
> "requerst_module()", and start filtering them on that basis.
>
> Some of them will already have checked for capabilities.
>
> Others clearly expect to just work even _without_ capabilities (ie
> the bluetoothd case).
>
> So the whole "let's add a global config option" model is broken. There
> is no possible global rule. It will break things, which in turn mean
> that people won't turn it on (and we can't turn it on by default),
> which in turn makes this pointless.
>
> In other words, I really think that anything that just adds a mode
> flag cannot work.
>
> So instead of having one "modules_autoload_mode" thing, maybe the
> individual requerst_module() cases need to simply be audited.
>
> Put another way: I think the part of your patch series that does that
> "request_module_cap()" and makes the netdev modules use it is a good
> addition.
>
> It's the "mode" part I really don't agree with, because apparently we
> really need to default it to permissive.
>
> So how about instead:
>
>  - add that "request_module_cap()" and make the networking code that
> already uses CAP_ADMIN_NET use it.
>
>  - make "request_module()" itself default to being
> "request_module_cap(CAP_SYS_MODULE,..)"
>
>  - make sure that when the capability check fails, we print an error
> message, and then for the ones that trigger, we will audit them and
> see if it's ok.

The issue isn't "is it always okay to cap-less-load this module?" it's
"should the kernel's attack surface be allowed to grown by an
unprivileged user?" Nearly all the request_module() sites have already
been audited and adjusted to avoid _arbitrary_ module loading (usually
by adding module prefixes: netdev-, crypto-, etc), and that greatly
helped reduce the ability for a unprivileged user to load a
known-buggy module, but there will remain many subsystems that
continue to expect to grow the kernel's features on demand within
their subsystem, and sometimes those modules will have bugs that an
unprivileged user can exploit (this history of this happening is quite
real; it's not a theoretical concern). This is especially problematic
for distro kernels where they build tons of modules.

There isn't a way for an admin to say "I only want root to be able to
load modules". They can't delete all the "unused" kernel modules,
because maybe root will want the module, they can't change module file
permissions because the modules are read with init's permissions (and
it would be terrible to repeatedly change all the file perms each time
a new kernel got installed), modules can only be blacklisted (not
whitelisted) in /etc/modprobe.d/, etc.

> Because that "mode" flag defaulting to off will just mean that the
> default case will remain the existing unsafe one, and that's bad.
>
> Opt-in really doesn't work. We've done it.
>
> Global flags for varied behavior really doesn't work. We've done that
> too. Different cases want different behavior, the global flag is just
> fundamentally broken.

I don't disagree that a global should be avoided, but I'm struggling
to see another option here. We can't break userspace by default so we
can't restrict cap-less loading by default. But we can allow userspace
to _choose_ to break itself, especially within a container. This isn't
uncommon, especially for modules, where we even have the global
"modules_disabled" sysctl already. The level of granularity of control
here is the issue, and it's what this series solves.

The options I see for module loading control are:
1) monolithic kernel (no modules)
2) modular kernel that flips on modules_disabled after boot (no
modules after boot)
3) modular kernel that allows per-subsystem unpriv module loading (all
modules loadable)

There is a demand for something between 2 and 3 where only root can
load modules. (And as pointed out in the series, this is _especially_
true for containers where the admin may want to leave module loading
alone in the init namespace, but stop any module loading in the
container.)

-Kees
Linus Torvalds Nov. 27, 2017, 11:14 p.m. UTC | #5
On Mon, Nov 27, 2017 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
>
> I don't disagree that a global should be avoided, but I'm struggling
> to see another option here. We can't break userspace by default so we
> can't restrict cap-less loading by default. But we can allow userspace
> to _choose_ to break itself, especially within a container. This isn't
> uncommon, especially for modules, where we even have the global
> "modules_disabled" sysctl already. The level of granularity of control
> here is the issue, and it's what this series solves.

So there's two "global" here

 - if a container were to choose to break itself, it should damn well
be container-specific, not some global option

   This part seems to be ok in the patch series, since the "global" is
really per-task. So it's not global in the "system-wide" sense.

 - if _one_ request_module() caller were to say "I don't want to be
loaded by a normal user", that doesn't mean that _other_
request_module() cases shouldn't.

   This is the part I'm objecting to, because it means that we can't
enable this stricter policy by default.

And the thing is, the patch series seems to already introduce largely
the better model of just making it site-specific. Introducing that
request_module_cap() thing and then using it for networking is a good
step.

But I also suspect that we _could_ just make the stricter rules
actually be default, if we just fixed the thing up to not be "every
request_module() is the same".

For example, several request_module() calls come from device node
opens, and it makes sense that we can just say: "if you have access to
the device node, then you have the right to request the module".

But that would need to be not a global "request_module()" behavior,
but a behavior that is tied to the particular call-site.

IOW, extend on that request_module_cap() model, and introduce
(perhaps) a "request_module_dev()" call that basically means "the user
opened the device node for the requested module".

Because those kinds of permissions aren't necessarily about
capabilities, but about things like "I'm in the dialout group, I get
to open tty devices and by implication request their modules".

And that _really_ isn't global behavior.  The fact that I might be
able to load a serial; module has *nothing* to do with whether I can
load some other kind of module at all.

That global mode is just wrong.

                Linus
Kees Cook Nov. 27, 2017, 11:19 p.m. UTC | #6
On Mon, Nov 27, 2017 at 3:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 27, 2017 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> I don't disagree that a global should be avoided, but I'm struggling
>> to see another option here. We can't break userspace by default so we
>> can't restrict cap-less loading by default. But we can allow userspace
>> to _choose_ to break itself, especially within a container. This isn't
>> uncommon, especially for modules, where we even have the global
>> "modules_disabled" sysctl already. The level of granularity of control
>> here is the issue, and it's what this series solves.
>
> So there's two "global" here
>
>  - if a container were to choose to break itself, it should damn well
> be container-specific, not some global option
>
>    This part seems to be ok in the patch series, since the "global" is
> really per-task. So it's not global in the "system-wide" sense.
>
>  - if _one_ request_module() caller were to say "I don't want to be
> loaded by a normal user", that doesn't mean that _other_
> request_module() cases shouldn't.
>
>    This is the part I'm objecting to, because it means that we can't
> enable this stricter policy by default.
>
> And the thing is, the patch series seems to already introduce largely
> the better model of just making it site-specific. Introducing that
> request_module_cap() thing and then using it for networking is a good
> step.
>
> But I also suspect that we _could_ just make the stricter rules
> actually be default, if we just fixed the thing up to not be "every
> request_module() is the same".
>
> For example, several request_module() calls come from device node
> opens, and it makes sense that we can just say: "if you have access to
> the device node, then you have the right to request the module".
>
> But that would need to be not a global "request_module()" behavior,
> but a behavior that is tied to the particular call-site.
>
> IOW, extend on that request_module_cap() model, and introduce
> (perhaps) a "request_module_dev()" call that basically means "the user
> opened the device node for the requested module".
>
> Because those kinds of permissions aren't necessarily about
> capabilities, but about things like "I'm in the dialout group, I get
> to open tty devices and by implication request their modules".
>
> And that _really_ isn't global behavior.  The fact that I might be
> able to load a serial; module has *nothing* to do with whether I can
> load some other kind of module at all.
>
> That global mode is just wrong.

What about exporting this entirely to userspace, giving it as much
context as possible? i.e. inform modprobe about the user doing it,
maybe the subsystem, etc?

-Kees
Linus Torvalds Nov. 27, 2017, 11:35 p.m. UTC | #7
On Mon, Nov 27, 2017 at 3:19 PM, Kees Cook <keescook@chromium.org> wrote:
>
> What about exporting this entirely to userspace, giving it as much
> context as possible? i.e. inform modprobe about the user doing it,
> maybe the subsystem, etc?

Yeah, except for the fact that we don't trust user-mode?

We used to do that exact thing. It was a nasty disaster, and caused
version skew and other horrible problems.

So no. Th e"let's just let user mode sort it out" doesn't work. User
mode doesn't sort anything out, it just makes it worse.

It's not some made-up example when I say that user-mode has decided
that kernel requests have to be completely serialized, and recusive
invocations will just hang.

So no. We do not go down that particular rat-hole. It's just a bigger
chance of getting things wrong.

                Linus
Kees Cook Nov. 28, 2017, 1:23 a.m. UTC | #8
On Mon, Nov 27, 2017 at 3:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 27, 2017 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> I don't disagree that a global should be avoided, but I'm struggling
>> to see another option here. We can't break userspace by default so we
>> can't restrict cap-less loading by default. But we can allow userspace
>> to _choose_ to break itself, especially within a container. This isn't
>> uncommon, especially for modules, where we even have the global
>> "modules_disabled" sysctl already. The level of granularity of control
>> here is the issue, and it's what this series solves.
>
> So there's two "global" here
>
>  - if a container were to choose to break itself, it should damn well
> be container-specific, not some global option
>
>    This part seems to be ok in the patch series, since the "global" is
> really per-task. So it's not global in the "system-wide" sense.

Right, though in the case of init, it could flip that toggle for
itself and it would then effectively be system-wide.

>  - if _one_ request_module() caller were to say "I don't want to be
> loaded by a normal user", that doesn't mean that _other_
> request_module() cases shouldn't.
>
>    This is the part I'm objecting to, because it means that we can't
> enable this stricter policy by default.

Okay, I see what you mean here. You want to clearly distinguish
between unprivileged and privileged-only. I'm unconvinced that's going
to change much, as the bulk of the exposed request_module() users are
already expecting to be unprivileged (and that's why they were all
converted to requiring a named prefix).

> And the thing is, the patch series seems to already introduce largely
> the better model of just making it site-specific. Introducing that
> request_module_cap() thing and then using it for networking is a good
> step.
>
> But I also suspect that we _could_ just make the stricter rules
> actually be default, if we just fixed the thing up to not be "every
> request_module() is the same".

When doing some of the older module name prefix work, I did consider
introducing a new request_module() API that included the prefix name
as an explicit argument (instead of embedding it in the format
string). We could easily start there, and then have "plain"
request_module() require privs. But we'll still need a way to say
"admin doesn't want unpriv module auto-loading".

> For example, several request_module() calls come from device node
> opens, and it makes sense that we can just say: "if you have access to
> the device node, then you have the right to request the module".

Many of these callers are using network interfaces to do this work, so
there isn't as clean a permission model associated with those like
there might be with a filesystem open(). But that doesn't matter (see
below).

> But that would need to be not a global "request_module()" behavior,
> but a behavior that is tied to the particular call-site.
>
> IOW, extend on that request_module_cap() model, and introduce
> (perhaps) a "request_module_dev()" call that basically means "the user
> opened the device node for the requested module".
>
> Because those kinds of permissions aren't necessarily about
> capabilities, but about things like "I'm in the dialout group, I get
> to open tty devices and by implication request their modules".

This really doesn't address the main concern that is the problem:
whitelisting vs blacklisting. In your example, the dialout group gives
access to specific ttys or serial ports, etc, but an admin may want a
way to make sure the users don't load some buggy line discipline. Now,
that admin could blacklist all those modules one at a time, but new
stuff might get introduced, it doesn't handle other subsystems, etc.

We need to provide a way to whitelist autoloaded modules. The
demonstrated need (to whitelist _no_ modules, addressed by this
series) provides that level of control on a task basis (effectively
making it container-specific).

> And that _really_ isn't global behavior.  The fact that I might be
> able to load a serial; module has *nothing* to do with whether I can
> load some other kind of module at all.
>
> That global mode is just wrong.

If the per-task thing stays and the global sysctl goes, that would be
fine by me. That still gives admins a way to control the autoload
behavior, assuming their init knows how to set the flag. The global
sysctl, in my mind, is really more of a way for an admin to do this
after the fact without rebooting, etc. But I don't have a strong
opinion about the global sysctl.

-Kees
Geo Kozey Nov. 28, 2017, 12:16 p.m. UTC | #9
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Mon Nov 27 23:04:58 CET 2017
> To: Djalal Harouni <tixxdz@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>, Andy Lutomirski <luto@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Luis R. Rodriguez <mcgrof@kernel.org>, James Morris <james.l.morris@oracle.com>, Ben Hutchings <ben.hutchings@codethink.co.uk>, Solar Designer <solar@openwall.com>, Serge Hallyn <serge@hallyn.com>, Jessica Yu <jeyu@kernel.org>, Rusty Russell <rusty@rustcorp.com.au>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, kernel-hardening@lists.openwall.com <kernel-hardening@lists.openwall.com>, Jonathan Corbet <corbet@lwn.net>, Ingo Molnar <mingo@kernel.org>, David S. Miller <davem@davemloft.net>, Network Development <netdev@vger.kernel.org>, Peter Zijlstra <peterz@infradead.org>
> Subject: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules
> 
> 
> On Mon, Nov 27, 2017 at 1:41 PM, Djalal Harouni <tixxdz@gmail.com> wrote:
> >
> > However, we are trying hard to abstract some semantics that are easy
> > to grasp, we are mutating capabilities and seccomp to have an
> > abstracted "yes/no" options for our endusers.
> 
> Yes.
> 
> Sadly, it looks like we actually do have users that just expect to
> load modules dynamically without any capabilities at all.
> 
> So we can't actually disallow it by default at all, which imho makes
> this security option essentially useless.
> 
> A security option that people can't use without breaking their system
> is pointless.
> 
> We saw that with SELinux - people ended up just disabling it for
> _years_, simply because it ended up breaking so much in practice. And
> yes, it got fixed eventually, but at an incredibly high maintenance
> cost of all the crazy rules databases.
> 
> > Alright, but I guess we are stuck, is there something better on how we
> > can do this or describe this ?
> 
> So I wonder if we can perhaps look at the places that actually do
> "requerst_module()", and start filtering them on that basis.
> 
> Some of them will already have checked for capabilities.
> 
> Others clearly expect to juist work even _without_ capabilities (ie
> the bluetoothd case).
> 
> So the whole "let's add a global config option" model is broken. There
> is no possible global rule. It will break things, which in turn mean
> that people won't turn it on (and we can't turn it on by default),
> which in turn makes this pointless.
> 
> In other words, I really think that anything that just adds a mode
> flag cannot work.
> 
> So instead of having one "modules_autoload_mode" thing, maybe the
> individual requerst_module() cases need to simply be audited.
> 
> Put another way: I think the part of your patch series that does that
> "request_module_cap()" and makes the netdev modules use it is a good
> addition.
> 
> It's the "mode" part I really don't agree with, because apparently we
> really need to default it to permissive.
> 
> So how about instead:
> 
>  - add that "request_module_cap()" and make the networking code that
> already uses CAP_ADMIN_NET use it.
> 
>  - make "request_module()" itself default to being
> "request_module_cap(CAP_SYS_MODULE,..)"
> 
>  - make sure that when the capability check fails, we print an error
> message, and then for the ones that trigger, we will audit them and
> see if it's ok.
> 
> Because that "mode" flag defaulting to off will just mean that the
> default case will remain the existing unsafe one, and that's bad.
> 
> Opt-in really doesn't work. We've done it.
> 
> Global flags for varied behavior really doesn't work. We've done that
> too. Different cases want different behavior, the global flag is just
> fundamentally broken.
> 
>               Linus
> ----------------------------------------

The only reason this is off by default is kernel rule No 1. Most of new security features are off or WARN_ON because of it otherwise bad things happen on mailinglists :) .

It seems to me that you demand an impossible job here. Don't break userspace and don't disable by default. It can't happen due to years of technical debt. I mean everyone prefers default-on but we can choose best alternative instead of maintaining status quo.

Userspace can be configured in a way which is compatible with those changes being on the same as it can be configured to work with selinux. That means on distro level or sysadmin level it's a valuable tool. It's better than nothing and it's better than using some out-of-tree patches instead. Switching one sysctl would make their life easier.

There is demand for kernel hardening stuff and default-off behavior is good compromise between breaking changes and nothing at all. The kernel has very diverse users group with different needs and capabilities so there is no one size fits all. Targeting all of them will often end in nothing useful to anyone.

Yours sincerely

G. K.
Theodore Ts'o Nov. 28, 2017, 7:32 p.m. UTC | #10
On Tue, Nov 28, 2017 at 01:16:59PM +0100, Geo Kozey wrote:
> 
> Userspace can be configured in a way which is compatible with those
> changes being on the same as it can be configured to work with
> selinux. That means on distro level or sysadmin level it's a
> valuable tool. It's better than nothing and it's better than using
> some out-of-tree patches instead. Switching one sysctl would make
> their life easier.

If *selinux* can opt-in to something more stringent, such that when
you upgrade to a new version of selinux which enables something which
breaks all modules unless you set up the rules corretly, I don't see a
problem with it.  It might force distributions not to go to the latest
version of SELinux because users get cranky when their systems get
broken, but for people like me, who *still* don't use SELinux because
every few years, i try to enable on my development laptop running
Debian, watch ***far*** too much stuff break. and then turn it off
again.  So tieing it to SELinux (as far as I am concerned) reduces it to
a previously unsolved problem.  :-)

But that's different from opting it on by default for non-SELinux
users.  To which I can only say, "Please, No."

	   	       	    	 	  - Ted
Kees Cook Nov. 28, 2017, 8:08 p.m. UTC | #11
On Tue, Nov 28, 2017 at 11:32 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 28, 2017 at 01:16:59PM +0100, Geo Kozey wrote:
>>
>> Userspace can be configured in a way which is compatible with those
>> changes being on the same as it can be configured to work with
>> selinux. That means on distro level or sysadmin level it's a
>> valuable tool. It's better than nothing and it's better than using
>> some out-of-tree patches instead. Switching one sysctl would make
>> their life easier.
>
> If *selinux* can opt-in to something more stringent, such that when
> you upgrade to a new version of selinux which enables something which
> breaks all modules unless you set up the rules corretly, I don't see a
> problem with it.  It might force distributions not to go to the latest
> version of SELinux because users get cranky when their systems get
> broken, but for people like me, who *still* don't use SELinux because
> every few years, i try to enable on my development laptop running
> Debian, watch ***far*** too much stuff break. and then turn it off
> again.  So tieing it to SELinux (as far as I am concerned) reduces it to
> a previously unsolved problem.  :-)
>
> But that's different from opting it on by default for non-SELinux
> users.  To which I can only say, "Please, No."

I don't want to see this tied to SELinux because it narrows the
audience, and SELinux still hasn't solved their issues in containers.
I think the per-task setting is sufficient.

Linus, are you okay with this series if the global sysctl gets dropped?

-Kees
Linus Torvalds Nov. 28, 2017, 8:12 p.m. UTC | #12
On Tue, Nov 28, 2017 at 12:08 PM, Kees Cook <keescook@chromium.org> wrote:
>
> Linus, are you okay with this series if the global sysctl gets dropped?

So really, it's not the "global sysctl" as much as the "global
request_module()" that annoys me.

I'll happily take the request_module_cap() part and the thing that
makes networking use that.

But the flag that we have to default to off because it breaks every
single box otherwise? No. It doesn't  matter if it's one single global
or just a "global behavior for request_module() for this process" at
that point, it's still a pointless security flag that is opt-in.

             Linus
Kees Cook Nov. 28, 2017, 8:20 p.m. UTC | #13
On Tue, Nov 28, 2017 at 12:12 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 28, 2017 at 12:08 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> Linus, are you okay with this series if the global sysctl gets dropped?
>
> So really, it's not the "global sysctl" as much as the "global
> request_module()" that annoys me.
>
> I'll happily take the request_module_cap() part and the thing that
> makes networking use that.
>
> But the flag that we have to default to off because it breaks every
> single box otherwise? No. It doesn't  matter if it's one single global
> or just a "global behavior for request_module() for this process" at
> that point, it's still a pointless security flag that is opt-in.

To be clear: such a flag wouldn't doesn't break every system, but I
understand your concern.

So what's the right path forward for allowing a way to block
autoloading? Separate existing request_module() calls into "must be
privileged" and "can be unpriv" first, then rework the series to deal
with the "unpriv okay" subset?

-Kees
Linus Torvalds Nov. 28, 2017, 8:33 p.m. UTC | #14
On Tue, Nov 28, 2017 at 12:20 PM, Kees Cook <keescook@chromium.org> wrote:
>
> So what's the right path forward for allowing a way to block
> autoloading? Separate existing request_module() calls into "must be
> privileged" and "can be unpriv" first, then rework the series to deal
> with the "unpriv okay" subset?

So once we've taken care of the networking ones that check their own
different capability bit, maybe we can then make the regular
request_module() do a rate-limited warning for non-CAP_SYS_MODULE uses
that prints which module it's loading.

And then just see what people report.

Because maybe it's just a very small handful that matters, and we can
say "those are ok".

And maybe that is too optimistic, and we have a lot of device driver
ones because people still have a static /dev and don't have udev
populating modules and device nodes, and then maybe we need to
introduce a "request_module_dev()" where the rule is that you had to
at least have privileges to open the device node.

Because I really am *not* interested in these security flags that are
off by default and then get turned on by special cases. I think it's
completely unacceptable to say "we're insecure by default but then you
can do X and be secure". It doesn't work. It doesn't fix anything.

                 Linus
Djalal Harouni Nov. 28, 2017, 9:10 p.m. UTC | #15
On Tue, Nov 28, 2017 at 9:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 28, 2017 at 12:20 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> So what's the right path forward for allowing a way to block
>> autoloading? Separate existing request_module() calls into "must be
>> privileged" and "can be unpriv" first, then rework the series to deal
>> with the "unpriv okay" subset?
>
> So once we've taken care of the networking ones that check their own
> different capability bit, maybe we can then make the regular
> request_module() do a rate-limited warning for non-CAP_SYS_MODULE uses
> that prints which module it's loading.

Alright, I can start by those.


> And then just see what people report.
>
> Because maybe it's just a very small handful that matters, and we can
> say "those are ok".

The ones that are in the cover letter, etc may not have the
appropriate context, the request_module_dev() sure can be made since
if you can open you already have the context.


> And maybe that is too optimistic, and we have a lot of device driver
> ones because people still have a static /dev and don't have udev
> populating modules and device nodes, and then maybe we need to
> introduce a "request_module_dev()" where the rule is that you had to
> at least have privileges to open the device node.
>
> Because I really am *not* interested in these security flags that are
> off by default and then get turned on by special cases. I think it's
> completely unacceptable to say "we're insecure by default but then you
> can do X and be secure". It doesn't work. It doesn't fix anything.

this still leaves all the cases where we don't have the appropriate
context and other implicit loads that are triggered by another
implicit load, etc.

Also the simple local flag is easy to grasp, with real users for it,
and we can abstract on top with load "newfeatures" of course this does
not mean that we should say "we-re insecure by default". I want it to
be more allow newfeatures or not for apps and users... requiring caps
may give users the idea to pass CAP_SYS_MODULE or other caps for
something that used to work, they may start giving it if we break lot
of usecases, and yeh the caps are much broader and do much more
harm...

Ok, so beside updating with request_module_cap() I will investigate
request_module_dev() and we can see

Thanks!

>                  Linus
Kees Cook Nov. 28, 2017, 9:33 p.m. UTC | #16
On Tue, Nov 28, 2017 at 12:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 28, 2017 at 12:20 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> So what's the right path forward for allowing a way to block
>> autoloading? Separate existing request_module() calls into "must be
>> privileged" and "can be unpriv" first, then rework the series to deal
>> with the "unpriv okay" subset?
>
> So once we've taken care of the networking ones that check their own
> different capability bit, maybe we can then make the regular
> request_module() do a rate-limited warning for non-CAP_SYS_MODULE uses
> that prints which module it's loading.
>
> And then just see what people report.

I'm fine to try this experiment. I just think it'll be very noisy. :)

> Because maybe it's just a very small handful that matters, and we can
> say "those are ok".

See, that's the problem: it's not a matter of "those are ok". Those
modules will either have bugs discovered in them or will gain bugs
over time. The point is to provide a mechanism to block an entire
attack surface on systems that do not need auto-loading (which is a
lot of systems).

As I've said before, this isn't a theoretical attack surface. This
year alone there have been three known-exploitable flaws exposed by
autoloading:

The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
TIOCSETD, &n_hdlc) [1]. This is using the existing "tty-ldisc-"
prefix, and is intentionally unprivileged.

The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
IPPROTO_IP) [2]. This is using the existing proto prefix, and is
intentionally unprivileged.

The exploit for CVE-2017-7184 uses socket(PF_NETLINK, SOCK_RAW,
NETLINK_XFRM) [3]. This is using existing proto prefix and became
unprivileged by way of user namespaces (CAP_NET_ADMIN in a
user-namespace).

Concerned admins have already disabled user namespaces, so they were
protected from the last one, but they couldn't stop the first two
without explicit blacklists ahead of time. (Though anyone paying
attention to vulnerability histories would have already blacklisted
DCCP.)

> And maybe that is too optimistic, and we have a lot of device driver
> ones because people still have a static /dev and don't have udev
> populating modules and device nodes, and then maybe we need to
> introduce a "request_module_dev()" where the rule is that you had to
> at least have privileges to open the device node.
>
> Because I really am *not* interested in these security flags that are
> off by default and then get turned on by special cases. I think it's
> completely unacceptable to say "we're insecure by default but then you
> can do X and be secure". It doesn't work. It doesn't fix anything.

I empathize with this stance, but the last many years of hardening has
been doing this just fine, and we've progressively moved defenses from
"off by default" to "on by default", by way of distros, etc, over
time. But sometimes we have things that must stay off by default to
avoid breaking rule #1, even though every distro has turned it on by
default (e.g. link restrictions). So, saying "it must be on by default
and not break userspace" puts us in a bit of a catch-22.

It sounds like the proposed path forward is:

1) convert existing privileged request_module() usage to use explicit
permission and/or capability checks
2) mark the remaining with WARN_RATELIMIT
3) wait
4) ???

It's steps 3 and 4 that worry me because we have a standing need for a
solution here, and "wait" isn't a very convincing solution and step 4
is unknown. We know we're going to have unprivileged module loading
(e.g. AF_ALG), even if we reduce the places where it's used. In the
end, we will still need a way for a process to say "I don't want to
auto-load modules from here on out". That's the part I want to figure
out, otherwise Djalal doesn't have a way to address the attack
surface.

-Kees

[1] https://github.com/snorez/exploits/blob/master/cve-2017-2636/cve-2017-2636.c
[2] https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-6074/poc.c
[3] https://github.com/snorez/exploits/blob/master/cve-2017-7184/exp.c
Geo Kozey Nov. 28, 2017, 9:51 p.m. UTC | #17
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Tue Nov 28 21:33:22 CET 2017
> To: Kees Cook <keescook@chromium.org>
> Subject: Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

> Because I really am *not* interested in these security flags that are
> off by default and then get turned on by special cases. I think it's
> completely unacceptable to say "we're insecure by default but then you
> can do X and be secure". It doesn't work. It doesn't fix anything.
> 
>                  Linus
> ----------------------------------------

What about "we're insecure by default but you can't do anything to change this"? It describes current situation. For last 20 years linux allowed for insecure behavior and tons of tools were built depending on it. It's recurring theme of kernel security development. I'll be glad if some genius propose perfect idea solving this problem but I'm afraid things go nowhere instead.

Yours sincerely

G. K.
Theodore Ts'o Nov. 28, 2017, 11:23 p.m. UTC | #18
On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
> As I've said before, this isn't a theoretical attack surface. This
> year alone there have been three known-exploitable flaws exposed by
> autoloading:
> 
> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
> TIOCSETD, &n_hdlc) [1]. This is using the existing "tty-ldisc-"
> prefix, and is intentionally unprivileged.
> 
> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
> intentionally unprivileged.

So in these two cases, if the kernel was built w/o modules, and HDLC
and DCCP was built-in, you'd be screwed, then?

Is the goal here to protect people using distro kernels which build
the world as modules, including dodgy pieces of kernel code that are
bug-ridden?

If so, then presumably 90% of the problem you've cited can be done by
creating a script which takes a look of the modules that are normally
in use once the machine is in production, and then deleting everything
else?  Correct?

And yes, this will potentially break some users, but the security
folks who are advocating for the more aggressive version of this
change seem to be OK with breaking users, so they can do this without
making kernel changes.  Good luck getting Red Hat and SuSE to accept
such a change, though....

						- Ted
Kees Cook Nov. 28, 2017, 11:29 p.m. UTC | #19
On Tue, Nov 28, 2017 at 3:23 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
>> As I've said before, this isn't a theoretical attack surface. This
>> year alone there have been three known-exploitable flaws exposed by
>> autoloading:
>>
>> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
>> TIOCSETD, &n_hdlc) [1]. This is using the existing "tty-ldisc-"
>> prefix, and is intentionally unprivileged.
>>
>> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
>> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
>> intentionally unprivileged.
>
> So in these two cases, if the kernel was built w/o modules, and HDLC
> and DCCP was built-in, you'd be screwed, then?

Sure, but that's not the common situation.

> Is the goal here to protect people using distro kernels which build
> the world as modules, including dodgy pieces of kernel code that are
> bug-ridden?

The bulk of the risk comes from distro kernels, yes. (Though "bug
ridden" is a strong statement. There are and will be bugs, scattered
across a wide portion of the kernel, it's just that modules tend to
cover most of that attack surface.)

> If so, then presumably 90% of the problem you've cited can be done by
> creating a script which takes a look of the modules that are normally
> in use once the machine is in production, and then deleting everything
> else?  Correct?

This isn't always obvious nor is it easy: a kernel upgrade will
reinstall everything, etc. This is effectively a blacklist, and we
need an easy whitelist approach. And an admin may not want to make it
so hard for themselves to load some special filesystem module that
they have to hunt it down and reinstall it. Or they'll want some
containers to be able to trigger module loading and others not to.
They just want to know their users aren't triggering module loading
that expands the kernel attack surface.

> And yes, this will potentially break some users, but the security
> folks who are advocating for the more aggressive version of this
> change seem to be OK with breaking users, so they can do this without
> making kernel changes.  Good luck getting Red Hat and SuSE to accept
> such a change, though....

We should strive to make the kernel easy to protect. Expecting admins
to jump through these hoops isn't a sensible way forward.

-Kees
Theodore Ts'o Nov. 28, 2017, 11:49 p.m. UTC | #20
On Tue, Nov 28, 2017 at 03:29:01PM -0800, Kees Cook wrote:
> > So in these two cases, if the kernel was built w/o modules, and HDLC
> > and DCCP was built-in, you'd be screwed, then?
> 
> Sure, but that's not the common situation.
> 
> > Is the goal here to protect people using distro kernels which build
> > the world as modules, including dodgy pieces of kernel code that are
> > bug-ridden?
> 
> The bulk of the risk comes from distro kernels, yes. (Though "bug
> ridden" is a strong statement. There are and will be bugs, scattered
> across a wide portion of the kernel, it's just that modules tend to
> cover most of that attack surface.)

OK, but if the goal is to protect users who are running distro
kernels, then a kernel config that breaks some percentage of users is
****highly**** unlikely to be enabled by Red Hat and SuSE, right?  And
many of these users either can't (from a skills perspective) or won't
(because they lose all support from the enterprise distro's help desk)
recompile their kernel to enable an "might break 3% of all users ---
oh, well" config option.

Which argues for being extremely conservative about making something
that has an extremely low probability of breaking users, and it points
out why Geo Kozey's "who cares about breaking users; security is
IMPORTANT" argument is so wrong-headed.

If the goal is to protect distro kernels, but a sloppy approach
guarantees that distro kernels will never enable it, is it really
worth it?

	   	       	       	    	  	 - Ted

P.S.  This is where it might be useful to get some input from the Red
Hat and SuSE support teams.  How many angry user calls to their help
desk are they willing to field before they'll just turn off the kernel
config option for their kernels?
Linus Torvalds Nov. 28, 2017, 11:51 p.m. UTC | #21
On Tue, Nov 28, 2017 at 1:51 PM, Geo Kozey <geokozey@mailfence.com> wrote:
>
> What about "we're insecure by default but you can't do anything to change this"? It describes current situation.

Go away, and don't send me patches until you have dug your head out of
whatever hole you have put it in..

If this is the kind of shit-headed responses I get from the
"hardening" list, then I don't want to have anything to do with you
guys.

Seriously.

I sent out a long explanation of what's wrong with the hardening
people last week. It made the news. If you still don't understand,
you're simply not worth working with.

If you cannot help improve kernel security for the default case, and
you can't even be bothered to try, and only want to fix some special
case that doesn't then improve anything at all for most people, I
really _really_ suggest you go play in your own sandbox.

Because clearly, if you're not interested in improving things for
anybody else, why the hell should you care about the upstream kernel
anyway?

That's what this boils down to: if you send me patches, you had better
strive to improve security for everybody, not just for some little
locked-down special case.

We're not grsecurity. We never have been. We're not interested in the
crazy people. We're interested in the kind of security that is
generally applicable.

To the mainline kernel, not breaking existing users matters, but it
also matters that the patches make sense for everybody, because
otherwise, why be mainline?

So a patch that avoids breaking existing users, but also doesn't
actually improve anything for existing users, simply shouldn't be part
of the mainline kernel.

Comprende?

                    Linus
Djalal Harouni Nov. 28, 2017, 11:53 p.m. UTC | #22
On Wed, Nov 29, 2017 at 12:23 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
>> As I've said before, this isn't a theoretical attack surface. This
>> year alone there have been three known-exploitable flaws exposed by
>> autoloading:
>>
>> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
>> TIOCSETD, &n_hdlc) [1]. This is using the existing "tty-ldisc-"
>> prefix, and is intentionally unprivileged.
>>
>> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
>> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
>> intentionally unprivileged.
>
> So in these two cases, if the kernel was built w/o modules, and HDLC
> and DCCP was built-in, you'd be screwed, then?
>
> Is the goal here to protect people using distro kernels which build
> the world as modules, including dodgy pieces of kernel code that are
> bug-ridden?
>
> If so, then presumably 90% of the problem you've cited can be done by
> creating a script which takes a look of the modules that are normally
> in use once the machine is in production, and then deleting everything
> else?  Correct?
>
> And yes, this will potentially break some users, but the security
> folks who are advocating for the more aggressive version of this
> change seem to be OK with breaking users, so they can do this without
> making kernel changes.  Good luck getting Red Hat and SuSE to accept
> such a change, though....

The patches does not change default and make it easy for users and we
have request for this, not all world is Red Hat / SuSE , I build
embedded Linux for clients when I manage to have some, and I clearly
would have set this to my clients since most of them won't be able to
afford all the signing and complexity, now how I should allow modules
load/unload replace with newer versions, but restrict some of their
apps from triggering it, "modules_disabled=1" is not practical. I
can't build a perfect version for every usecase,  and they started to
ship apps for IoT, even containers for IoT, yes they do it and they
use the same os for various use cases! so it is not about those, even
embedded vendors have one single shared layer that they use for all
their products.

For distros, the target is also containers and sandboxes, and they are
already interested in it.

P.S. please the cover letter already mentions that this is for Embedded and IoT.

>                                                 - Ted
Linus Torvalds Nov. 29, 2017, 12:17 a.m. UTC | #23
On Tue, Nov 28, 2017 at 3:51 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So a patch that avoids breaking existing users, but also doesn't
> actually improve anything for existing users, simply shouldn't be part
> of the mainline kernel.

Just to clarify: maybe it ends up being truly impossible to make the
default be more restrictive. The default certainly won't be the most
restrictive option.

But at the same time, if people can't even be bothered to try to
improve the general case, and only do things that you have to opt in
for, it really isn't much of an improvement. We had this whole
"opt-in" discussion for another thread entirely, and it basically
didn't improve anything for anybody for the better half of a decade.

Hardening that only works for special cases isn't hardening at all. It
will just mean that 99+% of all kernel developers won't see the
fallout at all.

Yes, something like Android may be 99% of the devices, but it's a very
small portion of the core developer base because the hardware is all
locked down, and it's an even smaller portion of the usage patterns.

So I can see some people say "We can use this for android and protect
the 99%". But if it then is basically invisible to the rest of the
user base, it means that all those servers etc aren't getting the kind
of protection they should have.

Just to take that DCCP thing as an example: being a module wasn't what
made it vulnerable. It would have been vulnerable compiled in too.
What made it vulnerable was that the DCCP code simply isn't widely
enough used and tested, and basically barely on life support. And it
was available much too widely despite that.

So all this is about is to make for a smaller attack surface.

But if it turns out that we can make the attack surface smaller by
simply white-listing a few modules that we know are actively used and
feel better about the quality of, that makes for a much smaller attack
surface _too_. And it does so in general, without having to set some
flag that 99% of all MIS people won't even really know about.

So that's why I want people to look at a different approach. Yes, the
opt-in model means that by default nothing changes. That protects
against the whole "oops, we don't break user space". But it has a huge
downside.

The model that I am a proponent of is to take a softer approach
initially: don't forbid module loading (because that breaks users),
but instead _warn_ about non-root module loading. And then we can
start fixing the cases that we find.

See? This is *exactly* the same thing that the user-mode access thing
was about. Hardening people need to get over their "hard rules"
mindset. We don't kill processes or forbid them from doing things that
might be bad. We start by warning about them, to see what "might be
bad" cases are actually normal, and not actually bad at all. And then
we use that information to guide our notion of what should actually
trigger a stronger response.

                   Linus
Kees Cook Nov. 29, 2017, 12:18 a.m. UTC | #24
On Tue, Nov 28, 2017 at 3:49 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> OK, but if the goal is to protect users who are running distro
> kernels, then a kernel config that breaks some percentage of users is
> ****highly**** unlikely to be enabled by Red Hat and SuSE, right?  And
> many of these users either can't (from a skills perspective) or won't
> (because they lose all support from the enterprise distro's help desk)
> recompile their kernel to enable an "might break 3% of all users ---
> oh, well" config option.

There's a spectrum across "insane not to be done everywhere"
(STRICT_KERNEL_RWX), "this is a good idea for nearly all Linux
systems" (link restrictions), "this can break some common use-cases,
but protects systems without that use-case" (user-namespace
disabling), "this breaks most systems, but specialized deployments
really need it" (modules_disabled).

There's also a difference between immutable CONFIG options that cannot
be disabled at runtime, those that can, global sysctls, per-namespace
controls, etc etc. The kernel is all about providing admins with knobs
to tweak their performance and security. Suddenly being told that we
can't create optional improvements is very odd.

Now, being told "make it easier to audit all the module loading we're
already doing so we can further reduce needless exposures for everyone
even without this feature enabled", THAT makes sense. My point there
is that we've already done those kinds of things; see commits like:

  7f78e0351394 ("fs: Limit sys_mount to only request filesystem modules.")
  5d26a105b5a7 ("crypto: prefix module autoloading with "crypto-"")
  4943ba16bbc2 ("crypto: include crypto- module prefix in template")

> Which argues for being extremely conservative about making something
> that has an extremely low probability of breaking users, and it points
> out why Geo Kozey's "who cares about breaking users; security is
> IMPORTANT" argument is so wrong-headed.

I don't want to break users either. I want to provide meaningful ways
for admins to reduce their kernel attack surface. Djalal and I aren't
advocating for on-by-default removal of module auto-loading (that
would have been a very small patch!). The idea was to provide a
dynamic control over it, and make that available to distros. As shown
in the patch series, it would be immediately put to use with systemd
for process-tree isolation and for container restriction.

> If the goal is to protect distro kernels, but a sloppy approach
> guarantees that distro kernels will never enable it, is it really
> worth it?

I don't think this is sloppy and of course distros would see use,
since systemd would be using it.

> P.S.  This is where it might be useful to get some input from the Red
> Hat and SuSE support teams.  How many angry user calls to their help
> desk are they willing to field before they'll just turn off the kernel
> config option for their kernels?

This isn't about giant-hammer CONFIGs -- this is like more like
PR_SET_DUMPABLE or seccomp.

-Kees
Kees Cook Nov. 29, 2017, 12:26 a.m. UTC | #25
On Tue, Nov 28, 2017 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> But at the same time, if people can't even be bothered to try to
> improve the general case, and only do things that you have to opt in
> for, it really isn't much of an improvement. We had this whole

Right, and I'm fine to help with those general improvements. As I
mentioned in another reply, we've been steadily narrowing the
exposures by keeping request_module() from being exposed to "all
possible modules" in many of the places it's been exposed to
userspace:

  7f78e0351394 ("fs: Limit sys_mount to only request filesystem modules.")
  5d26a105b5a7 ("crypto: prefix module autoloading with "crypto-"")
  4943ba16bbc2 ("crypto: include crypto- module prefix in template")

I don't want you to think this is an area where work hasn't already
been done to try to improve things.

> The model that I am a proponent of is to take a softer approach
> initially: don't forbid module loading (because that breaks users),
> but instead _warn_ about non-root module loading. And then we can
> start fixing the cases that we find.

I am totally fine with this. The question I'm hoping to have answered
is, "then what?" We already have concrete examples of module
autoloading that will still be need to stay unprivileged and as-is in
the kernel (even if we remove others). What do you see as the way to
allow an admin to turn those off?

-Kees
Linus Torvalds Nov. 29, 2017, 12:50 a.m. UTC | #26
On Tue, Nov 28, 2017 at 4:26 PM, Kees Cook <keescook@chromium.org> wrote:
>
>> The model that I am a proponent of is to take a softer approach
>> initially: don't forbid module loading (because that breaks users),
>> but instead _warn_ about non-root module loading. And then we can
>> start fixing the cases that we find.
>
> I am totally fine with this. The question I'm hoping to have answered
> is, "then what?" We already have concrete examples of module
> autoloading that will still be need to stay unprivileged and as-is in
> the kernel (even if we remove others). What do you see as the way to
> allow an admin to turn those off?

Just thinking about the DCCP case, where networking people actually
knew it was pretty deprecated and had no real maintainer, I think one
thing to look at would be simply a per-module flag.

That kind of thing should be fairly easy to implement, along the same
lines as the module license - it just sets a flag in the ELF section
headers.

With something like that, we literally could make the default be "no
autoloading except for root", and then just mark the modules that we
think are ok and well maintained.

Sure, if you then do a lock-down mode that makes that flag parsing
stricter, then that's a separate thing. But I suspect we definitely
could be a lot stricter on a per-module basis, and do it in a way
where a normal user wouldn't even notice that we've limited the
autoloading.

But the first step would be to just add some noise. And even with the
per-module flag, at first it would only suppress the noise (ie we'd
still _allow_ loading other modules, they'd just be noisy). Then, if
nobody hollers, maybe the next kernel release we'll make it actually
enforce the flag.

Does that sound reasonable?

               Linus
Eric W. Biederman Nov. 29, 2017, 4:26 a.m. UTC | #27
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Nov 28, 2017 at 4:26 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>>> The model that I am a proponent of is to take a softer approach
>>> initially: don't forbid module loading (because that breaks users),
>>> but instead _warn_ about non-root module loading. And then we can
>>> start fixing the cases that we find.
>>
>> I am totally fine with this. The question I'm hoping to have answered
>> is, "then what?" We already have concrete examples of module
>> autoloading that will still be need to stay unprivileged and as-is in
>> the kernel (even if we remove others). What do you see as the way to
>> allow an admin to turn those off?
>
> Just thinking about the DCCP case, where networking people actually
> knew it was pretty deprecated and had no real maintainer, I think one
> thing to look at would be simply a per-module flag.
>
> That kind of thing should be fairly easy to implement, along the same
> lines as the module license - it just sets a flag in the ELF section
> headers.
>
> With something like that, we literally could make the default be "no
> autoloading except for root", and then just mark the modules that we
> think are ok and well maintained.
>
> Sure, if you then do a lock-down mode that makes that flag parsing
> stricter, then that's a separate thing. But I suspect we definitely
> could be a lot stricter on a per-module basis, and do it in a way
> where a normal user wouldn't even notice that we've limited the
> autoloading.
>
> But the first step would be to just add some noise. And even with the
> per-module flag, at first it would only suppress the noise (ie we'd
> still _allow_ loading other modules, they'd just be noisy). Then, if
> nobody hollers, maybe the next kernel release we'll make it actually
> enforce the flag.
>

On a slight tangent to all of this.

The issue of reducing attack surface has also come up in another thread
and it was suggested there that we make some ns_capable calls
conditionally capable calls so certain pieces of code won't be available
in user namespaces, when we know there is a bug but don't yet have a
good fix rolled out yet.

Which again brings us to attack surface reduction.

I was wondering if perhaps a better way to do some of that, might be to
have places in the kernel where we could use something like ftrace to
add a permission check at a well known functions boundaries and fail the
functions when we want to reduce the attack surface.

It is not as elegant as adding a maintenance status to a module and only
allowing actively maintained modules to be auto-loaded.  But perhaps
with a few more eyes the idea can be fleshed out to something generally useful.

Eric
Theodore Ts'o Nov. 29, 2017, 6:36 a.m. UTC | #28
On Tue, Nov 28, 2017 at 04:18:59PM -0800, Kees Cook wrote:
> There's also a difference between immutable CONFIG options that cannot
> be disabled at runtime, those that can, global sysctls, per-namespace
> controls, etc etc. The kernel is all about providing admins with knobs
> to tweak their performance and security. Suddenly being told that we
> can't create optional improvements is very odd.

I just think that tweakable knobs are mostly pointless.  From my
experience the number of sysadmins that adjust knobs is ***tiny***[1].
Put another way, the effort to determine whether tweaking a knob will
result in breakages or will be safe is as much work as creating a
white list of modules that are allowed to be loaded.

[1] And I say that having providing a lot of knobs for ext4.  :-)

This is why some on the kernel-hardening list have argued for making
the default to be opt-out, which means some users will be breaken (and
their answer to that seems to be, "oh well --- gotta break some eggs
to make an omlette".  Sucks if you're one of the eggs, though.)

And I don't see how systemd magically means no one will be broken.  If
you have a non-root process trying to invoke a line discpline which
has to be loaded as a module, if you flip the switch, that process
will be broken.  How does using systemd make the problem go away?

					- Ted
Geo Kozey Nov. 29, 2017, 2:46 p.m. UTC | #29
> From: Theodore Ts'o <tytso@mit.edu>
> Sent: Wed Nov 29 00:49:20 CET 2017
> Subject: Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules
> 
> OK, but if the goal is to protect users who are running distro
> kernels, then a kernel config that breaks some percentage of users is
> ****highly**** unlikely to be enabled by Red Hat and SuSE, right?  And
> many of these users either can't (from a skills perspective) or won't
> (because they lose all support from the enterprise distro's help desk)
> recompile their kernel to enable an "might break 3% of all users ---
> oh, well" config option.
> 
> Which argues for being extremely conservative about making something
> that has an extremely low probability of breaking users, and it points
> out why Geo Kozey's "who cares about breaking users; security is
> IMPORTANT" argument is so wrong-headed.
> ----------------------------------------

My argument was the opposite - don't break anything by default. Nobody here wanted to break any users. Some of them may choose stricter option because it actually doesn't break anything relevant for them. That's all. It's the same as perf_event_paranoid, kptr_restrict, yama.ptrace_scope sysctls are working currently.

We're talking about runtime, not compile time option. I doubt RH or SUSE prevent users from changing anything at runtime.


G. K.
Geo Kozey Nov. 29, 2017, 3:28 p.m. UTC | #30
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Wed Nov 29 01:17:05 CET 2017
> To: Geo Kozey <geokozey@mailfence.com>
> Subject: Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules
> 
> 
> On Tue, Nov 28, 2017 at 3:51 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So a patch that avoids breaking existing users, but also doesn't
> > actually improve anything for existing users, simply shouldn't be part
> > of the mainline kernel.
> 
> Just to clarify: maybe it ends up being truly impossible to make the
> default be more restrictive. The default certainly won't be the most
> restrictive option.
> 
> But at the same time, if people can't even be bothered to try to
> improve the general case, and only do things that you have to opt in
> for, it really isn't much of an improvement. We had this whole
> "opt-in" discussion for another thread entirely, and it basically
> didn't improve anything for anybody for the better half of a decade.
> 
> Hardening that only works for special cases isn't hardening at all. It
> will just mean that 99+% of all kernel developers won't see the
> fallout at all.
> 
> Yes, something like Android may be 99% of the devices, but it's a very
> small portion of the core developer base because the hardware is all
> locked down, and it's an even smaller portion of the usage patterns.
> 
> So I can see some people say "We can use this for android and protect
> the 99%". But if it then is basically invisible to the rest of the
> user base, it means that all those servers etc aren't getting the kind
> of protection they should have.
> 
> Just to take that DCCP thing as an example: being a module wasn't what
> made it vulnerable. It would have been vulnerable compiled in too.
> What made it vulnerable was that the DCCP code simply isn't widely
> enough used and tested, and basically barely on life support. And it
> was available much too widely despite that.
> 
> So all this is about is to make for a smaller attack surface.
> 
> But if it turns out that we can make the attack surface smaller by
> simply white-listing a few modules that we know are actively used and
> feel better about the quality of, that makes for a much smaller attack
> surface _too_. And it does so in general, without having to set some
> flag that 99% of all MIS people won't even really know about.
> 
> So that's why I want people to look at a different approach. Yes, the
> opt-in model means that by default nothing changes. That protects
> against the whole "oops, we don't break user space". But it has a huge
> downside.
> 
> The model that I am a proponent of is to take a softer approach
> initially: don't forbid module loading (because that breaks users),
> but instead _warn_ about non-root module loading. And then we can
> start fixing the cases that we find.
> 
> See? This is *exactly* the same thing that the user-mode access thing
> was about. Hardening people need to get over their "hard rules"
> mindset. We don't kill processes or forbid them from doing things that
> might be bad. We start by warning about them, to see what "might be
> bad" cases are actually normal, and not actually bad at all. And then
> we use that information to guide our notion of what should actually
> trigger a stronger response.
> 
>                    Linus

I followed your last week discussion and I agreed with your arguments so
it's unfortunate for me that You put me on the opposite side. I agree that
improving kernel code by finding and fixing existing bugs is the ultimate
goal for security features.

On the other hand the whole debugging process (find, report, wait for fix)
takes a lot of time and work and is basically never ending story. That's why
some kind of prevention which protects users in-between is important for
users which priorities maximal security on their systems.

I like your idea about printing warnings. What about:

modules_autoload_mode =0 (default)
_warn_ about non-root module loading and allow

modules_autoload_mode =1
_warn_ about non-root module loading and deny

This way we can start fixing found cases and also give users option for proactive
defense at the same time without breaking general public. The end game (after
a lot of fixes) would be to make 0 and 1 mostly indistinguishable from security
perspective. 

Yours sincerely

G. K.
Kees Cook Nov. 29, 2017, 6:30 p.m. UTC | #31
On Tue, Nov 28, 2017 at 4:50 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 28, 2017 at 4:26 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>>> The model that I am a proponent of is to take a softer approach
>>> initially: don't forbid module loading (because that breaks users),
>>> but instead _warn_ about non-root module loading. And then we can
>>> start fixing the cases that we find.
>>
>> I am totally fine with this. The question I'm hoping to have answered
>> is, "then what?" We already have concrete examples of module
>> autoloading that will still be need to stay unprivileged and as-is in
>> the kernel (even if we remove others). What do you see as the way to
>> allow an admin to turn those off?
>
> Just thinking about the DCCP case, where networking people actually
> knew it was pretty deprecated and had no real maintainer, I think one
> thing to look at would be simply a per-module flag.
>
> That kind of thing should be fairly easy to implement, along the same
> lines as the module license - it just sets a flag in the ELF section
> headers.
>
> With something like that, we literally could make the default be "no
> autoloading except for root", and then just mark the modules that we
> think are ok and well maintained.
>
> Sure, if you then do a lock-down mode that makes that flag parsing
> stricter, then that's a separate thing. But I suspect we definitely
> could be a lot stricter on a per-module basis, and do it in a way
> where a normal user wouldn't even notice that we've limited the
> autoloading.
>
> But the first step would be to just add some noise. And even with the
> per-module flag, at first it would only suppress the noise (ie we'd
> still _allow_ loading other modules, they'd just be noisy). Then, if
> nobody hollers, maybe the next kernel release we'll make it actually
> enforce the flag.
>
> Does that sound reasonable?

Yeah, I think I see the way forward here; thanks for the discussion!

-Kees
Linus Torvalds Nov. 29, 2017, 6:46 p.m. UTC | #32
On Wed, Nov 29, 2017 at 10:30 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Nov 28, 2017 at 4:50 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Just thinking about the DCCP case, where networking people actually
>> knew it was pretty deprecated and had no real maintainer, I think one
>> thing to look at would be simply a per-module flag.
>>   [ ... ]
>> Does that sound reasonable?
>
> Yeah, I think I see the way forward here; thanks for the discussion!

Note: I don't want to really force that per-module flag if it ends up
being painful, I was really just thinking that considering the DCCP
case, it would be (I think) a really nice solution.

In particular, request_module() ends up having that indirection
through usermode-helper, which makes it potentially very inconvenient
to store the "did the original caller have proper capabilities" and
then check it at actual module load time.

So the module flag is technically easy to add, and it's technically
easy to read at module loading time, but I suspect that it's actually
annoyingly hard to pass the original request_module() capability
information around to where we actually read the module.

That's why it might be _much_ easier to try to do it per call-site
instead. It's not quite as fine-grained (because several call-sites do
things like

        request_module("net-pf-%d-proto-%d", PF_NETLINK, protocol);

that can load a large number of different modules), but if we can get
away with just saying "this particular callsite is ok, because it only
loads the bluetooth module that we thing is trustworthy" or "this
call-site is ok, because you already had to have access to the device
node", that is going to be much simpler and more straightforward.

In other words: I think there are at least two different models to go
after, and there may be practical reasons to prefer one over the
other.

                    Linus
Linus Torvalds Nov. 29, 2017, 6:53 p.m. UTC | #33
On Wed, Nov 29, 2017 at 10:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the module flag is technically easy to add, and it's technically
> easy to read at module loading time, but I suspect that it's actually
> annoyingly hard to pass the original request_module() capability
> information around to where we actually read the module.

One possibly interesting approach would be to run the usermode helper
not as root, but with the credentials of the request_module() caller.

That's arguably the right thing to do (in that request_module() would
never do anything that the user wouldn't be able to do on their own)
and probably what we should have done originally, but while it feels
like a nice solution I suspect it would break pretty much every distro
out there.

Because they all expect modprobe/kmod to be called as root in the
original init-namespace.

                 Linus
Kees Cook Nov. 29, 2017, 9:17 p.m. UTC | #34
On Wed, Nov 29, 2017 at 10:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 29, 2017 at 10:30 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Nov 28, 2017 at 4:50 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Just thinking about the DCCP case, where networking people actually
>>> knew it was pretty deprecated and had no real maintainer, I think one
>>> thing to look at would be simply a per-module flag.
>>>   [ ... ]
>>> Does that sound reasonable?
>>
>> Yeah, I think I see the way forward here; thanks for the discussion!
>
> Note: I don't want to really force that per-module flag if it ends up
> being painful, I was really just thinking that considering the DCCP
> case, it would be (I think) a really nice solution.
>
> In particular, request_module() ends up having that indirection
> through usermode-helper, which makes it potentially very inconvenient
> to store the "did the original caller have proper capabilities" and
> then check it at actual module load time.

Yeah, for this it seems like the permission _evaluation_ would happen
at request_module() time, and it could pass that as a module argument
to the usermode-helper, so that during load_module(), it would have
all the needed information to take an action:

- was this load considered privileged?
- is this module marked privileged-only?
- what are we configured to do in the case of mismatches?

> So the module flag is technically easy to add, and it's technically
> easy to read at module loading time, but I suspect that it's actually
> annoyingly hard to pass the original request_module() capability
> information around to where we actually read the module.

The clever solution grsecurity uses for this is to pass information as
module loading arguments (mentioned above) in call_modprobe(), and
then read them back in load_module(). This seemed like overkill to me,
but it does allow for delaying the decision on the action.

I think this would only be needed to deal with per-module markings,
though, since all other cases the evaluation and action could happen
at request_module() time. (More on this below.)

> That's why it might be _much_ easier to try to do it per call-site
> instead. It's not quite as fine-grained (because several call-sites do
> things like
>
>         request_module("net-pf-%d-proto-%d", PF_NETLINK, protocol);
>
> that can load a large number of different modules), but if we can get
> away with just saying "this particular callsite is ok, because it only
> loads the bluetooth module that we thing is trustworthy" or "this
> call-site is ok, because you already had to have access to the device
> node", that is going to be much simpler and more straightforward.

This is basically what we've historically already tried to do: limit
request_module() calls to either one or a set of const strings, or to
load with a format template. Removing all the
request_module(whatever_the_user_says) calls has (hopefully) been
completed, though sometimes weird stuff sneaks by.

The difficulty I see here is that it's not always clear which are
expected to be privileged loads or not. Even an analysis[1] of whether
or not request_module() is using a format template doesn't tell us
much. Is it an unprivileged net module like your example above or is
it some privileged device hotplug event like
request_module("ulpi:v%04xp%04x", ulpi->id.vendor, ulpi->id.product)?

> In other words: I think there are at least two different models to go
> after, and there may be practical reasons to prefer one over the
> other.

I would agree, and it's why I thought the original series was good: I
feel like we've already done the harder work of narrowing
request_module() calls by subsystem prefixes over the years, and
separating privilege with a new API didn't make sense to me. The
privileges are a state to check in the existing API, etc. But as I
said, if there's a way forward you're happy with, let's do it. It just
needs to get to the point where an admin can explicitly disable
unprivileged module autoloading with some appropriate level of
granularity.

[email merging your later reply...]
> One possibly interesting approach would be to run the usermode helper
> not as root, but with the credentials of the request_module() caller.
>
> That's arguably the right thing to do (in that request_module() would
> never do anything that the user wouldn't be able to do on their own)
> and probably what we should have done originally, but while it feels
> like a nice solution I suspect it would break pretty much every distro
> out there.
>
> Because they all expect modprobe/kmod to be called as root in the
> original init-namespace.

Dropping to the request_module() privileges for the usermode helper
would absolutely break the world. It would actually be much more
disruptive than the proposed series. :)

One of the (many) problems with module loading is that it changes the
global features available to the kernel: suddenly all existing
processes have access to whatever feature just got loaded (modulo a
syscall path to touch the code). This is a rather ugly from the
perspective of containers: unprivileged module loading can change the
kernel globally for all containers, the load gets triggered to run in
the init userspace, not the container, etc, etc. (Though loading from
the container would be nonsense too.)

So, what we have now is that the permission verification already
happens at and around the existing request_module() calls. Callers are
already checking caps within containers, for example, or isolating
which things can be loaded with subsystem format prefixes, etc. All
the loading logic is being done around the request_module() calls, and
once this is deemed "safe", we explicitly elevate privileges to
init-ns CAP_SYS_MODULE and load the module by way of the usermode
helper's call to finit_module(). As in, we have an intentional
CAP_SYS_MODULE privilege elevation for doing module loading.

What I like about the proposed request_modue_cap() is that is codifies
the specific privilege change "having this cap allows you to gain
CAP_SYS_MODULE to load this possible prefixed subset of modules",
instead of having that logic open-coded near the request_module() call
site. And we can add more of these explicit kinds of calls, but it
won't change the need for a way to say "no privilege elevations should
be allowed".

It still sounds like you'd like to see an explicit change, similar to
the proposed request_module_cap(), that identifies the privilege
expectations on a per-call-site basis. How about this plan:

1) Add request_module_cap(required_cap, module_name_prefix, fmt, fmt_args...)

2) Convert known privileged-but-not-CAP_SYS_MODULE request_module()
callers to request_module_cap(the_needed_cap, prefix, ...)

2) Convert known unprivileged callers to use request_module_cap(0, ...)

3) Add WARN_RATELIMIT for request_module() calls without
CAP_SYS_MODULE to shake out other places where request_module_cap() is
needed.

4) Adapt the original patch series to add the per-process flag that
can block privilege elevations.

This will get us the improvement of removing any unexpectedly
unprivileged request_module() call sites for everyone, and still
provide the optional process flag to mark a process as not allowed to
make change its effective privilege to load a module.

Does this sound workable?

-Kees


[1] Just some greps to see the extent of request_module() use...

We've got a bit over 400 call sites:

$ git grep request_module'\b' | wc -l
438

About a third aren't using a format string, so they're loading either
a specific module or a specific alias:

$ git grep request_module'\b' | grep -v % | grep '"' | wc -l
160

Another third use format strings to load some aliased subset:

$ git grep request_module'\b' | grep % | wc -l
154

The rest construct their strings manually, making them less easy to
trivially examine.
Linus Torvalds Nov. 29, 2017, 10:14 p.m. UTC | #35
On Wed, Nov 29, 2017 at 1:17 PM, Kees Cook <keescook@chromium.org> wrote:
>
> So, what we have now is that the permission verification already
> happens at and around the existing request_module() calls.

Usually, yes.

I liked the "request_module_cap()" interface partly because that made
the net/core/dev_ioctl.c ones more explicit, and maybe it could be
convenient if we make other places do similar things.

I was hoping some other users could be converted, but grepping around,
there's no obvious cases. There is tcp_cong.c and tcp_ulp.c, but they
want some extra locking in between the checking..

> It still sounds like you'd like to see an explicit change, similar to
> the proposed request_module_cap(), that identifies the privilege
> expectations on a per-call-site basis. How about this plan:

Yes.

I'd be perfectly happy to have a long-range plan where the existing
"request_module()" ends up requiring more capabilities.

I just don't think it's a good first step, exactly because *if* it's a
first step, it basically has to be disabled by default.

And once you disable it by default, and it becomes purely opt-in, that
means that nothing will change for most cases. Some embedded people
that do their own thing (ie Android) might change, but normal
distributions probably won't.

Yes, Android may be 99% of the users, and yes, the embedded world in
general needs to be secure, but I'd still like this to be something
that helps _everybody_.

So:

> 1) Add request_module_cap(required_cap, module_name_prefix, fmt, fmt_args...)
>
> 2) Convert known privileged-but-not-CAP_SYS_MODULE request_module()
> callers to request_module_cap(the_needed_cap, prefix, ...)

Yes. The upside seems to be very limited here, but at least it makes
the users that use CAP_NET_ADMIN instead of CAP_SYS_MODULE able to
specify so.

> 2) Convert known unprivileged callers to use request_module_cap(0, ...)

0 is CAP_CHOWN, so it would have to be -1.

And I wouldn't actually want to see that as-is. Not only would I not
want to see people have that "-1" in random driver subsystems, I'd
much prefer to have actual helper naming that descibes why something
is ok

Because as mentioned, I think there are valid permission reasons that
are _not_ about capabilities that make you able to load a module.

If you can open a character device node, then "misc_open()" will do

                request_module("char-major-%d-%d", MISC_MAJOR, minor);

and there is nothing about capabilities in the CAP_SYS_MODULE sense
about the user. But the user _did_ have the privileges to open that
character device file.

That's why I suggested something like request_module_dev(): it's not
at all the same thing as request_module_cap(-1, ...), saying "I don't
need/have a capability". It's saying something else entirely, it's
basically saying "I have the right based on device permissions".

And something like request_module_dev() might even have real semantic
meaning, exactly because it says "this module request comes from
trying to open a device node".

Why would that be? If we know we're on a system where /dev is
auto-populated through udev, then the device nodes should have been
created by the drivers, not the other way around. So we might even
have a rule that notices that automatically, and simply disables
request_module_dev() entirely.

Anyway, I'm not saying that is necessarily something we should do, but
I do suspect that we could adapt to modern systems without having to
have tons of magic settings, and try to be as strict as possible
without breaking them.

Because I dislike "system tuning" in general. I hate knobs that do
kernel performance tuning - we try very hard to just DTRT wrt sizing
hashes etc instead of expecting the system admin to set flags.

And I think we can try to avoid some system tuning in this area too.

I suspect that for a lot of our existing request_module() cases, they
really are pretty trivial. In most cases, it's probably really about
whether you have the hardware or not.

So for the hardware driver cases, either the hardware enumerates
itself, or it is presumably set up by the system scripts anyway, and
CAP_SYS_MODULE is all fine. The "open device node" case is one special
case, though.

That mainly leaves the protocol ones we need to look out for, I suspect.

> 3) Add WARN_RATELIMIT for request_module() calls without
> CAP_SYS_MODULE to shake out other places where request_module_cap() is
> needed.

Yes.

And this is where I hope that there really aren't actually all that
many cases that will warn, and that it's hopefully easy to simply just
look at a handful of reports and say "ok, that case is obviously
fine".

And I may be wrong.

> 4) Adapt the original patch series to add the per-process flag that
> can block privilege elevations.

Yes.

            Linus
Kees Cook Nov. 30, 2017, 12:44 a.m. UTC | #36
On Wed, Nov 29, 2017 at 2:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 29, 2017 at 1:17 PM, Kees Cook <keescook@chromium.org> wrote:
>> 1) Add request_module_cap(required_cap, module_name_prefix, fmt, fmt_args...)
>>
>> 2) Convert known privileged-but-not-CAP_SYS_MODULE request_module()
>> callers to request_module_cap(the_needed_cap, prefix, ...)
>
> Yes. The upside seems to be very limited here, but at least it makes
> the users that use CAP_NET_ADMIN instead of CAP_SYS_MODULE able to
> specify so.

Yeah, agreed; it's limited in scope.

>> 2) Convert known unprivileged callers to use request_module_cap(0, ...)
>
> 0 is CAP_CHOWN, so it would have to be -1.

Oops, yes, think-o.

> And I wouldn't actually want to see that as-is. Not only would I not
> want to see people have that "-1" in random driver subsystems, I'd
> much prefer to have actual helper naming that descibes why something
> is ok

So, if some catch-all is needed, maybe request_module_unpriv(...), but
I think it should be possible to identify SOME kind of defining
privilege to check.

> Because as mentioned, I think there are valid permission reasons that
> are _not_ about capabilities that make you able to load a module.
>
> If you can open a character device node, then "misc_open()" will do
>
>                 request_module("char-major-%d-%d", MISC_MAJOR, minor);
>
> and there is nothing about capabilities in the CAP_SYS_MODULE sense
> about the user. But the user _did_ have the privileges to open that
> character device file.
>
> That's why I suggested something like request_module_dev(): it's not
> at all the same thing as request_module_cap(-1, ...), saying "I don't
> need/have a capability". It's saying something else entirely, it's
> basically saying "I have the right based on device permissions".
>
> And something like request_module_dev() might even have real semantic
> meaning, exactly because it says "this module request comes from
> trying to open a device node".

I just want to make sure I'm visualizing this correctly. Using the
misc_open() example, you're thinking something like:

    request_module_dev(file, "char-major-%d-%d", MISC_MAJOR, minor);

which would then do a verification that file->f_cred matched
current_cred()? (For misc_open(), this is obviously going to always be
okay, but other cases may be further from the fops open handler...)

> Why would that be? If we know we're on a system where /dev is
> auto-populated through udev, then the device nodes should have been
> created by the drivers, not the other way around. So we might even
> have a rule that notices that automatically, and simply disables
> request_module_dev() entirely.

Right, we have both directions -- hotplug module loading ("I saw this
PCI alias, load something! Oh! It needs some fancy crypto too!") and
on-use module loading. It seems like on-use module loading would get
covered by request_module_dev(), and the hotplug case would always be
privileged. Anyway... exploring this with real code is clearly
warranted.

> I suspect that for a lot of our existing request_module() cases, they
> really are pretty trivial. In most cases, it's probably really about
> whether you have the hardware or not.

Right, and those would all be privileged, etc.

> So for the hardware driver cases, either the hardware enumerates
> itself, or it is presumably set up by the system scripts anyway, and
> CAP_SYS_MODULE is all fine. The "open device node" case is one special
> case, though.

Yeah, going through the call sites and asking "where does the
privilege for loading this module derive from?" for each one will help
shape the resulting API changes.

> That mainly leaves the protocol ones we need to look out for, I suspect.

This is where a lot of the exposure really comes from. socket()
triggers a bunch of stuff, but doesn't have an obvious privilege
associated with it... while it already does the name templates, maybe
add request_module_socket() just to explicitly mark it?

>> 3) Add WARN_RATELIMIT for request_module() calls without
>> CAP_SYS_MODULE to shake out other places where request_module_cap() is
>> needed.
>
> Yes.
>
> And this is where I hope that there really aren't actually all that
> many cases that will warn, and that it's hopefully easy to simply just
> look at a handful of reports and say "ok, that case is obviously
> fine".
>
> And I may be wrong.
>
>> 4) Adapt the original patch series to add the per-process flag that
>> can block privilege elevations.
>
> Yes.

Okay, excellent. Hopefully we haven't scared off Djalal, and hopefully
Jessica doesn't think this is all insane. :)

Thanks!

-Kees
Linus Torvalds Nov. 30, 2017, 2:08 a.m. UTC | #37
On Wed, Nov 29, 2017 at 4:44 PM, Kees Cook <keescook@chromium.org> wrote:
>
>> That mainly leaves the protocol ones we need to look out for, I suspect.
>
> This is where a lot of the exposure really comes from. socket()
> triggers a bunch of stuff, but doesn't have an obvious privilege
> associated with it... while it already does the name templates, maybe
> add request_module_socket() just to explicitly mark it?

.. and this is where I'd expect that maybe we'd need some hackery.

Even including some ad-hoc rules like "this module is actually
maintained", possibly even with some /sys interface to extend/reduce
that set.

But maybe it's not even that bad.

                Linus
Daniel Micay Nov. 30, 2017, 6:51 a.m. UTC | #38
> And once you disable it by default, and it becomes purely opt-in, that
> means that nothing will change for most cases. Some embedded people
> that do their own thing (ie Android) might change, but normal
> distributions probably won't.
> 
> Yes, Android may be 99% of the users, and yes, the embedded world in
> general needs to be secure, but I'd still like this to be something
> that helps _everybody_.

Android devices won't get much benefit since they ship a tiny set of
modules chosen for the device. The kernels already get very stripped
down to the bare minimum vs. enabling every feature and driver available
and shipping it all by default on a traditional distribution.

Lots of potential module attack surface also gets eliminated by default
via their SELinux whitelists for /dev, /sys, /proc, debugfs, ioctl
commands, etc. The global seccomp whitelist might be relevant in some
cases too.

Android devices like to build everything into the kernel too, so even if
they weren't using a module this feature wouldn't usually help them. It
would need to work like this existing sysctl:

    net.ipv4.tcp_available_congestion_control = cubic reno lp

i.e. whitelists for functionality offered by the modules, not just
whether they can be loaded.
Djalal Harouni Nov. 30, 2017, 8:50 a.m. UTC | #39
On Thu, Nov 30, 2017 at 7:51 AM, Daniel Micay <danielmicay@gmail.com> wrote:
[...]
> Lots of potential module attack surface also gets eliminated by default
> via their SELinux whitelists for /dev, /sys, /proc, debugfs, ioctl
> commands, etc. The global seccomp whitelist might be relevant in some
> cases too.

In embedded systems we can't maintain a SELinux policy, distro man
power hardly manage. We have abstracted seccomp etc, but the kernel
inherited the difficult multiplex things, plus all other paths that
trigger this.


> Android devices like to build everything into the kernel too, so even if
> they weren't using a module this feature wouldn't usually help them. It
> would need to work like this existing sysctl:
>
>     net.ipv4.tcp_available_congestion_control = cubic reno lp
>
> i.e. whitelists for functionality offered by the modules, not just
> whether they can be loaded.

Yes, but it is hard to maintain a whitelist policy, the code is hardly
maintained... if you include everything you should have an LSM policy
or something like that, and compiling kernels is expert thing.
Otherwise IMHO the kernel should provide default secure behaviour on
how to load or add new functionality to the running one. From a user
perspective, a switch "yes/no" that a privileged entity will
*understand* and assume is what should be there, and the switch or
flag as discussed here is local to processes, the sysctl will be
removed. IMO it should come from userspace point of view, cause as an
example the sysctl:

net.ipv4.tcp_available_congestion_control = cubic reno lp

Is kernel thing, too technical, userspace developers, admins or
privileged entity will not understand what cubic or reno mean. Doing
the same per functionality directly like this seems to much of a
burden compared to the use case. The kernel maybe can do this to
advance the art of the networking stack and for advanced cases, but in
IMHO a sane default behaviour + an abstracted process/sandbox flag
"yes/no" for most others, userspace developers and humans is what
should be provided and we need the kernel to help here.

It seems that Linus and kees agreed on this direction which allows me
to follow up.

Thanks!
Theodore Ts'o Nov. 30, 2017, 2:16 p.m. UTC | #40
On Thu, Nov 30, 2017 at 09:50:27AM +0100, Djalal Harouni wrote:
> In embedded systems we can't maintain a SELinux policy, distro man
> power hardly manage. We have abstracted seccomp etc, but the kernel
> inherited the difficult multiplex things, plus all other paths that
> trigger this.....

> Yes, but it is hard to maintain a whitelist policy, the code is hardly
> maintained...

So this is the part that scares me to death about IOT, and why I tell
everyone to ***never*** trust an IOT device on their home network, and
***never*** trust it with anything you don't mind splattered all over
the front page of NY Times and RT / Sputnick news.

You're saying that you want to use modules (as opposed to compile
everything tightly down to just what you need for the embedded
system); that the code is "hardly maintained".  And yet we're supposed
to consider it trustworthy?

If that's the case, turning off implicit module loading sounds and
thinking that this will somehow be a magic wand sounds.... crazy.

	      	   		     - Ted
Djalal Harouni Nov. 30, 2017, 2:51 p.m. UTC | #41
On Thu, Nov 30, 2017 at 3:16 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Nov 30, 2017 at 09:50:27AM +0100, Djalal Harouni wrote:
>> In embedded systems we can't maintain a SELinux policy, distro man
>> power hardly manage. We have abstracted seccomp etc, but the kernel
>> inherited the difficult multiplex things, plus all other paths that
>> trigger this.....
>
>> Yes, but it is hard to maintain a whitelist policy, the code is hardly
>> maintained...
>
> So this is the part that scares me to death about IOT, and why I tell
> everyone to ***never*** trust an IOT device on their home network, and
> ***never*** trust it with anything you don't mind splattered all over
> the front page of NY Times and RT / Sputnick news.

Yes.

For your pleasure:
https://techcrunch.com/2017/04/25/brickerbot-is-a-vigilante-worm-that-destroys-insecure-iot-devices/
 bricked million of devices to stupid busybox remote port.
https://en.wikipedia.org/wiki/Mirai_(malware)  an other million bots
used to disturb netflix, twitter and others I don't know the details.
...

> You're saying that you want to use modules (as opposed to compile
> everything tightly down to just what you need for the embedded
> system); that the code is "hardly maintained".  And yet we're supposed
> to consider it trustworthy?

I didn't say that.

> If that's the case, turning off implicit module loading sounds and
> thinking that this will somehow be a magic wand sounds.... crazy.

The product costs decide, web developers, javascript, big data
analysis, electronic engineers all want to use Linux for IoT prototype
and sell in some months, they will get any kernel+userspace add their
value on top and sell. It will be non-sense to think that if a web
developer wants to sell a node.js app as an IoT he has to compile a
kernel and do all the other stuff, they all re-use the same layer the
same config for everything. Requiring for everyone to compile its own
kernel does not make much sense. Default safe behaviour is what we
should do.

Thanks!

>                                      - Ted
Daniel Micay Dec. 1, 2017, 6:39 a.m. UTC | #42
It was suggested that the feature would only be adopted in niches like
Android and I pointed out that it's not really relevant to Android.

It's a waste of time to try convincing me that it's useful elsewhere. I
never said or implied that it wasn't.

On Thu, 2017-11-30 at 09:50 +0100, Djalal Harouni wrote:
> On Thu, Nov 30, 2017 at 7:51 AM, Daniel Micay <danielmicay@gmail.com>
> wrote:
> [...]
> > Lots of potential module attack surface also gets eliminated by
> > default
> > via their SELinux whitelists for /dev, /sys, /proc, debugfs, ioctl
> > commands, etc. The global seccomp whitelist might be relevant in
> > some
> > cases too.
> 
> In embedded systems we can't maintain a SELinux policy, distro man
> power hardly manage. We have abstracted seccomp etc, but the kernel
> inherited the difficult multiplex things, plus all other paths that
> trigger this.

It's cheaper to use an existing system like Android Things where device
makers only need to make their apps and perhaps some userspace hardware
drivers for cases not covered by mainline kernel drivers. I don't think
it makes sense for every device vendor to manage an OS and I seriously
doubt that's how the ecosystem is going to end up as it matures.

> > Android devices like to build everything into the kernel too, so
> > even if
> > they weren't using a module this feature wouldn't usually help them.
> > It
> > would need to work like this existing sysctl:
> > 
> >     net.ipv4.tcp_available_congestion_control = cubic reno lp
> > 
> > i.e. whitelists for functionality offered by the modules, not just
> > whether they can be loaded.
> 
> Yes, but it is hard to maintain a whitelist policy, the code is hardly
> maintained... if you include everything you should have an LSM policy
> or something like that, and compiling kernels is expert thing.

I'm not talking about whitelist vs. blacklist, compiling kernels or
anything like that.

> Otherwise IMHO the kernel should provide default secure behaviour on
> how to load or add new functionality to the running one. From a user
> perspective, a switch "yes/no" that a privileged entity will
> *understand* and assume is what should be there, and the switch or
> flag as discussed here is local to processes, the sysctl will be
> removed. IMO it should come from userspace point of view, cause as an
> example the sysctl:
> 
> net.ipv4.tcp_available_congestion_control = cubic reno lp
> 
> Is kernel thing, too technical, userspace developers, admins or
> privileged entity will not understand what cubic or reno mean.

Congestion control algorithms are being used as an example in terms of
the mechanism being used to control which are available to unprivileged
users. The obscurity of congestion control algorithms is irrelevant.

>  Doing
> the same per functionality directly like this seems to much of a
> burden compared to the use case. The kernel maybe can do this to
> advance the art of the networking stack and for advanced cases, but in
> IMHO a sane default behaviour + an abstracted process/sandbox flag
> "yes/no" for most others, userspace developers and humans is what
> should be provided and we need the kernel to help here.

There are cases where unprivileged module auto-loading is relied upon
like network protocols. Having configuration for which protocols can be
used by unprivileged users is superior to limiting only which ones can
be auto-loaded. That's why I bought up the existing congestion control
knob. It works well in terms of having a whitelist of the sane, widely
used cases with exposing anything obscure requiring configuration. They
happen to be implemented as modules too.

Killing off unprivileged module loading other than a few cases like that
makes sense, and then those can provide similar control with similarly
sane defaults.
Marcus Meissner Dec. 1, 2017, 3:22 p.m. UTC | #43
On Tue, Nov 28, 2017 at 06:49:20PM -0500, Theodore Ts'o wrote:
> On Tue, Nov 28, 2017 at 03:29:01PM -0800, Kees Cook wrote:
> > > So in these two cases, if the kernel was built w/o modules, and HDLC
> > > and DCCP was built-in, you'd be screwed, then?
> > 
> > Sure, but that's not the common situation.
> > 
> > > Is the goal here to protect people using distro kernels which build
> > > the world as modules, including dodgy pieces of kernel code that are
> > > bug-ridden?
> > 
> > The bulk of the risk comes from distro kernels, yes. (Though "bug
> > ridden" is a strong statement. There are and will be bugs, scattered
> > across a wide portion of the kernel, it's just that modules tend to
> > cover most of that attack surface.)
> 
> OK, but if the goal is to protect users who are running distro
> kernels, then a kernel config that breaks some percentage of users is
> ****highly**** unlikely to be enabled by Red Hat and SuSE, right?  And
> many of these users either can't (from a skills perspective) or won't
> (because they lose all support from the enterprise distro's help desk)
> recompile their kernel to enable an "might break 3% of all users ---
> oh, well" config option.

Yes, breaking customers is not seen lightly.

I also (not related to this thread here, more to SLAB hardening et.al)
have a hard time getting performance losses caused by hardening features approved.
 
> Which argues for being extremely conservative about making something
> that has an extremely low probability of breaking users, and it points
> out why Geo Kozey's "who cares about breaking users; security is
> IMPORTANT" argument is so wrong-headed.
> 
> If the goal is to protect distro kernels, but a sloppy approach
> guarantees that distro kernels will never enable it, is it really
> worth it?
> 
> 	   	       	       	    	  	 - Ted
> 
> P.S.  This is where it might be useful to get some input from the Red
> Hat and SuSE support teams.  How many angry user calls to their help
> desk are they willing to field before they'll just turn off the kernel
> config option for their kernels?

Speaking for SUSE ... If something that worked for people before and
it breaks, we do get feedback. If no one used it however, we won't.

For our last major product we went over the network module list and 
disabled some for building. e.g. DCCP is no longer built. We did
not receive any complaints about missing DCCP to my knowledge.

We also seperate our modules into "regular supported" and "unsupported"
in different RPMs. The "unsupported" module packages are not shipped on
the Server product. They were shipped on the desktop as some of the WiFi
drivers were requested by customers but were considered not supportable.

We do review this supportable list between kernel version jumps.

Ciao, Marcus
diff mbox

Patch

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 7e690d0..fdd8560 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -382,8 +382,10 @@  void dev_load(struct net *net, const char *name)
 	rcu_read_unlock();
 
 	no_module = !dev;
+	/* "netdev-%s" modules are allowed if CAP_NET_ADMIN is set */
 	if (no_module && capable(CAP_NET_ADMIN))
-		no_module = request_module("netdev-%s", name);
+		no_module = request_module_cap(CAP_NET_ADMIN, "netdev",
+					       "%s", name);
 	if (no_module && capable(CAP_SYS_MODULE))
 		request_module("%s", name);
 }