mbox series

[GIT,PULL] LSM patches for v6.1

Message ID CAHC9VhShpEVTuogj4h74PxbEeTUNn4odo8SE6GBvb6sGUM0LHw@mail.gmail.com (mailing list archive)
State Accepted
Delegated to: Paul Moore
Headers show
Series [GIT,PULL] LSM patches for v6.1 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git tags/lsm-pr-20221003

Message

Paul Moore Oct. 3, 2022, 10:37 p.m. UTC
Hi Linus,

Seven patches for the LSM layer and we've got a mix of trivial and
significant patches, the highlights are below.  However, before we get
to that I want to mention that you'll notice a merge conflict with
this pull request, the good news is that it is small and easily
resolved.  The conflict occurs in security/selinux/include/classmap.h
and is due to the io_uring/SELinux patch which went in during the
v6.0-rcX cycle to fix the missing LSM/SELinux access controls for the
io_uring command passthrough.  I'm sure you'll figure out the merge on
your own, but if you are unsure, check that the bottom of the
secclass_map[] array definition looks like this:

  const struct security_class_mapping secclass_map[] = {
    /* ... */
    { "anon_inode",
      { COMMON_FILE_PERMS, NULL } },
    { "io_uring",
      { "override_creds", "sqpoll", "cmd", NULL } },
    { "user_namespace",
      { "create", NULL } },
    { NULL }
 };

On to the highlights, starting with the smaller bits first so they
don't get lost in the discussion of the larger items.

- Remove some redundant NULL pointer checks in the common LSM audit code.

- Ratelimit the lockdown LSM's access denial messages.  With this
change there is a chance that the last visible lockdown message on the
console is outdated/old, but it does help preserve the initial series
of lockdown denials that started the denial message flood and my gut
feeling is that these might be the more valuable messages.

- Open userfaultfds as readonly instead of read/write.  While this
code obviously lives outside the LSM, it does have a noticeable impact
on the LSMs with Ondrej explaining the situation in the commit
description.  It is worth noting that this patch languished on the VFS
list for over a year without any comments (objections or otherwise) so
I took the liberty of pulling it into the LSM tree after giving fair
notice.  It has been in linux-next since the end of August without any
noticeable problems.

- Add a LSM hook for user namespace creation, with implementations for
both the BPF LSM and SELinux.  Even though the changes are fairly
small, this is the bulk of the diffstat as we are also including BPF
LSM selftests for the new hook.  It's also the most contentious of the
changes in this pull request with Eric Biederman NACK'ing the LSM hook
multiple times during its development and discussion upstream.  While
I've never taken NACK's lightly, I'm sending these patches to you
because it is my belief that they are of good quality, satisfy a
long-standing need of users and distros, and are in keeping with the
existing nature of the LSM layer and the Linux Kernel as a whole.  The
patches in this pull request implement a LSM hook for user namespace
creation that allows for a granular approach, configurable at runtime,
which enables both monitoring and control of user namespaces.  The
general consensus has been that this is far preferable to the other
solutions that have been adopted downstream including outright removal
from the kernel, disabling via system wide sysctls, or various other
out-of-tree mechanisms that users have been forced to adopt since we
haven't been able to provide them an upstream solution for their
requests.  Eric has been steadfast in his objections to this LSM hook,
explaining that any restrictions on the user namespace could have
significant impact on userspace.  While there is the possibility of
impacting userspace, it is important to note that this solution only
impacts userspace when it is requested based on the runtime
configuration supplied by the distro/admin/user.  Frederick (the
pathset author), the LSM/security community, and myself have tried to
work with Eric during development of this patchset to find a mutually
acceptable solution, but Eric's approach and unwillingness to engage
in a meaningful way have made this impossible.  I have CC'd Eric
directly on this pull request so he has a chance to provide his side
of the story; there have been no objections outside of Eric's.

For reference, I'm providing links to the last five patchset
iterations in case you want to read the discussions in more detail.  I
suspect you can look just at the v5 discussion to get a sense of the
discussion and the arguments involved.

* v5
https://lore.kernel.org/linux-security-module/20220815162028.926858-1-fred@cloudflare.com

* v4
https://lore.kernel.org/linux-security-module/20220801180146.1157914-1-fred@cloudflare.com

* v3
https://lore.kernel.org/linux-security-module/20220721172808.585539-1-fred@cloudflare.com

* v2
https://lore.kernel.org/linux-security-module/20220707223228.1940249-1-fred@cloudflare.com

* v1
https://lore.kernel.org/linux-security-module/20220621233939.993579-1-fred@cloudflare.com

My hope is that you agree with the opinion that the LSM hook for user
namespace creation is something we should have upstream, but if you
agree with Eric Biederman and think this would be a mistake, let me
know and I'll respin this pull request without the LSM hook.

-Paul

--
The following changes since commit 568035b01cfb107af8d2e4bd2fb9aea22cf5b868:

 Linux 6.0-rc1 (2022-08-14 15:50:18 -0700)

are available in the Git repository at:

 git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git
   tags/lsm-pr-20221003

for you to fetch changes up to 1e7d8bcbe37d3c63babe628443f13f77970dd06b:

 lockdown: ratelimit denial messages (2022-09-14 07:37:50 -0400)

----------------------------------------------------------------
lsm/stable-6.1 PR 20221003

----------------------------------------------------------------
Frederick Lawler (4):
     security, lsm: Introduce security_create_user_ns()
     bpf-lsm: Make bpf_lsm_userns_create() sleepable
     selftests/bpf: Add tests verifying bpf lsm userns_create hook
     selinux: Implement userns_create hook

Nathan Lynch (1):
     lockdown: ratelimit denial messages

Ondrej Mosnacek (1):
     userfaultfd: open userfaultfds with O_RDONLY

Xiu Jianfeng (1):
     lsm: clean up redundant NULL pointer check

fs/userfaultfd.c                                   |   4 +-
include/linux/lsm_hook_defs.h                      |   1 +
include/linux/lsm_hooks.h                          |   4 +
include/linux/security.h                           |   6 ++
kernel/bpf/bpf_lsm.c                               |   1 +
kernel/user_namespace.c                            |   5 +
security/lockdown/lockdown.c                       |   2 +-
security/lsm_audit.c                               |  14 +--
security/security.c                                |   5 +
security/selinux/hooks.c                           |   9 ++
security/selinux/include/classmap.h                |   2 +
.../selftests/bpf/prog_tests/deny_namespace.c      | 102 +++++++++++++++++
.../selftests/bpf/progs/test_deny_namespace.c      |  33 +++++++
13 files changed, 172 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c
create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c

Comments

pr-tracker-bot@kernel.org Oct. 4, 2022, 1:03 a.m. UTC | #1
The pull request you sent on Mon, 3 Oct 2022 18:37:59 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git tags/lsm-pr-20221003

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/26b84401da8458c5cbd6818d5732f7bbb84124a2

Thank you!
Eric W. Biederman Oct. 4, 2022, 8:36 p.m. UTC | #2
Linus,
Please don't pull the user namespace bits of this code.

Paul Moore <paul@paul-moore.com> writes:

> Hi Linus,
>
> - Add a LSM hook for user namespace creation, with implementations for
> both the BPF LSM and SELinux.  Even though the changes are fairly
> small, this is the bulk of the diffstat as we are also including BPF
> LSM selftests for the new hook.  It's also the most contentious of the
> changes in this pull request with Eric Biederman NACK'ing the LSM hook
> multiple times during its development and discussion upstream.  While
> I've never taken NACK's lightly, I'm sending these patches to you
> because it is my belief that they are of good quality, satisfy a
> long-standing need of users and distros, and are in keeping with the
> existing nature of the LSM layer and the Linux Kernel as a whole.  The
> patches in this pull request implement a LSM hook for user namespace
> creation that allows for a granular approach, configurable at runtime,
> which enables both monitoring and control of user namespaces.  The
> general consensus has been that this is far preferable to the other
> solutions that have been adopted downstream including outright removal
> from the kernel, disabling via system wide sysctls, or various other
> out-of-tree mechanisms that users have been forced to adopt since we
> haven't been able to provide them an upstream solution for their
> requests.
>
> Eric has been steadfast in his objections to this LSM hook,
> explaining that any restrictions on the user namespace could have
> significant impact on userspace.  While there is the possibility of
> impacting userspace, it is important to note that this solution only
> impacts userspace when it is requested based on the runtime
> configuration supplied by the distro/admin/user.



> Frederick (the
> pathset author), the LSM/security community, and myself have tried to
> work with Eric during development of this patchset to find a mutually
> acceptable solution, but Eric's approach and unwillingness to engage
> in a meaningful way have made this impossible.  I have CC'd Eric
> directly on this pull request so he has a chance to provide his side
> of the story; there have been no objections outside of Eric's.

Paul I am not unwilling.  I have not had much time, or energy.
It feels like every disease that has been going around has made
it's way into my household in the last couple of months.  Including
my son catching COVID just a little while ago.  Inspite of that
I have taken time to engage but the things I have asked for
have not happened, so I have started nacking the patchset.


I have asked on multiple occasions what the problem that is being solved
with this hook so I could understand the problem, and I have asked that
the problem be documented in the patchset.  Something that has not
happened.


I have told you that from a this violates the design of the user
namespace, and said that because of that this is not something that
should be done casually, and it feels like you have simply brushed aside
the objection.  Instead of doing digging into this and showing that
this change to the design won't enable new problems.



From my limited amount of information I have into what people want to do
this appears to be shooting the messenger (the user namespace), given
that there is practically nothing to the user namespace itself.

This can be seen by considering a version of the linux kernel that does
not support privilege changes on exec.  In such a kernel everything that
is enabled as root in a user namespace would largely be enabled for all
applications all of the time.  As with user namespaces the only
restriction really needed is not to be able to modify a namespace
created by another user.  In such a kernel I strongly suspect the
problem that is trying to be solved by the create_user_ns hook is not
being solved.



There is the issue that Serge Hallyn demonstrated that if the concern is
about exploiting functionality the user namespace makes available
then setns is enough, if the user ever creates a user namespace for
any purpose.  Which makes fine grained limiting of user namespace
creation seem questionable.


There is the issue that you are adding a hook into the code user
namespace for out of tree code (LSM policies).  Code that has not shown
itself to be particularly fixable once a buggy version is deployed.
Examples of long time code challenges because of such out of tree code
include /proc/net pointing to /proc/self/net instead of
/proc/thread-self/net, and the same_thread_group check in
__ptrace_may_access.  When I added that same_thread_group check in
__ptrace_may_access I got the impression the LSM folks were going to fix
their LSMs so that it could be removed.

Usually when adding something for out of tree code, it is appropriate
to have a discussion on linux-api and to document the change.  Something
you have not done.


As a maintainer I am still getting bug reports about software that
attempts to lock down a container without using user namespaces.
I have to tell them the fix is use user namespaces.  I am concerned
that there will be random policies in place that encourage people
to perform less secure work-arounds in their user space applications
to get around whatever security polices will be invoked.  Something I
have not seen addressed.


This whole notion that you are going to randomly deny access to the user
namespace for random reasons and aren't willing to discuss those reasons
makes me supremely uncomfortable, and has me thinking that it will
result in a choice of maintaining user namespaces or breaking userspace
at some point.  I would really like some reassurance that we are not
going in that direction.



There is also the issue of the what happens in practice when you deny
creation of a user namespace.  Something I have asked to be explored,
and to which I have seen no one take the time and do.  Something that
won't happen in linux-next as the userspace polices that trigger the
new behavior are not there.

As far as I can tell there are a couple of scenarios that can
happen if you deny creation of a user namespace to an application.

1. It will decide it does not have what it needs and exit.
2. It will silently fallback on pre-usernamespace code that does
   not use a sandbox.  In my testing this is what chrome appears
   to do.
3. It won't try and create a user namespace so nothing happens.
4. It will be an program searching for a kernel exploit and it will
   just try another way to exploit the system.

Given that some applications will become less secure when denied access
to user namespace creation killing those applications with an
uncatchable signal rather than letting continue silently seems more
appropriate.  I suggested this but the suggestion received no traction
in our discussion.

Now that I have taken the time to verify my suspicions I am going to be
modifying the enforcement of max_user_namespaces to user an uncatchable
signal.  The entire notion of tricking an application to thinking it is
running on an old kernel that does not support user namespaces,
resulting in a silent down-grade of application security does not make
me at all comfortable.


As someone who is going to have to deal with the aftermath of adding
this security hook in the years ahead I would really appreciate that we
think these things through before we merge this and wind up in a very
unpleasant situation with no good choices in the years ahead.

I hope I have made my objections clear enough that they can be
understood this time around.

Eric
Linus Torvalds Oct. 4, 2022, 8:55 p.m. UTC | #3
On Tue, Oct 4, 2022 at 1:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Please don't pull the user namespace bits of this code.

Eric, already done.

And I think you are in denial about how many problems the
user-namespace stuff has caused.

Distros are literally turning it off entirely because the whole "let
users create their own namespace" has *NOT* been a great success.

I personally think it was a mistake. We're stuck with it, but we most
definitely need knobs to manage it that isn't just "enable/disable
USER_NS" in the kernel config.

So this whole "don't do this" approach you have is not acceptable.

99% of all code does NOT WANT the user namespace thing, and it's been
a big new attack surface for the kernel getting things subtly wrong.

I do not understand your "people need to be able to do this with no
controls", when the alternative is to literally turn it off ENTIRELY.

I'm not saying that an LSM is the only place to do it, but I don't
think there have been any better suggestions either.

Put another way: your "no limits are acceptable" is simply not
realistic, and you haven't given any sane alternatives that I am aware
of. No way to say "sure, let trusted system apps create their
namespaces, but not random things".

                Linus
Linus Torvalds Oct. 4, 2022, 9:30 p.m. UTC | #4
On Tue, Oct 4, 2022 at 1:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So this whole "don't do this" approach you have is not acceptable.

Side note: if we have a security hook for "create random file", then
the notion that creating a whole new namespace somehow must not have
any security hooks because it's *so* special is just ridiculous.

I also note that right now USER_NS is both "default n" and "if not
sure, say 'n'" in the Kconfig files, even though obviously that ship
has sailed long ago.

So originally it might have been a reasonable expectation to say "only
enable this if you're doing containers in servers", but that clearly
isn't the case any more. So we basically take USER_NS for granted, but
the fact that people might want chrome to use it for sandboxing does
*not* mean that randomly we want any CLONE_NEWNS to just be ok,
regardless of how trusted (or not) the case is.

                     Linus
Eric W. Biederman Oct. 5, 2022, 12:38 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 4, 2022 at 1:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Please don't pull the user namespace bits of this code.
>
> Eric, already done.
>
> And I think you are in denial about how many problems the
> user-namespace stuff has caused.
>
> Distros are literally turning it off entirely because the whole "let
> users create their own namespace" has *NOT* been a great success.
>
> I personally think it was a mistake. We're stuck with it, but we most
> definitely need knobs to manage it that isn't just "enable/disable
> USER_NS" in the kernel config.
>
> So this whole "don't do this" approach you have is not acceptable.
>
> 99% of all code does NOT WANT the user namespace thing, and it's been
> a big new attack surface for the kernel getting things subtly wrong.

Yes. I know, and I keep saying geez guys isn't this the problem?
And I get told nope.  That isn't it.


> I do not understand your "people need to be able to do this with no
> controls", when the alternative is to literally turn it off ENTIRELY.

We already have /proc/sys/user/max_user_namespaces.  It is a per userns
control so you can run it in as fine grain as you like.  A little
cumbersome perhaps but real.

> I'm not saying that an LSM is the only place to do it, but I don't
> think there have been any better suggestions either.

I don't know.  I tried to have the conversation and Paul shut it down.
Effectively he said that where two or more out of tree LSM policies want
something it makes no sense to discussion the actual reasons people want
to use the hook.

> Put another way: your "no limits are acceptable" is simply not
> realistic, and you haven't given any sane alternatives that I am aware
> of. No way to say "sure, let trusted system apps create their
> namespaces, but not random things".

That isn't my position at all, that isn't even the case in the current
code.

In there current code there are two mechanisms that can be used to limit
things to secure system apps.  There is
/proc/sys/user/max_user_namespaces, and the security_capable hook in the
LSM.

I can imagine that /proc/sys/user/max_user_namespaces could be a bit
awkward to use as things need to be shuffled around a bit to get
a user namespace in place that you can use to set your number additional
user namespaces to 0, for the untrusted apps.

I can imagine that security_capable being a little unintuitive to find
but it has a parameter telling you it wants a capability from a
non-default user namespace.

It would be the easiest thing in the world in security_capable to
ask is this a trusted app, if not the answer is no.



My big objections are: Paul Moore shutdown the entire discussion into
why this is needed and alternatives, and that the mechanism the hook
is using silently breaks userspace applications.

In particular chrome's sandbox is silently disabled.  So in practice I
see this change advocating for silently stripping security from
userspace applications.

There is a security trade-off between attack surface and securing
applications here that I could never get the conversation around too.

My sense is that Paul figures with the policy in userspace (AKA the code
that actually uses these hooks), that it is completely out of scope to
consider what functionality the hooks make available.

In short I completely failed to have any reasonable conversations about
this code or it's implications, and it breaks userspace.

Eric
Eric W. Biederman Oct. 5, 2022, 12:59 p.m. UTC | #6
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 4, 2022 at 1:55 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So this whole "don't do this" approach you have is not acceptable.
>
> Side note: if we have a security hook for "create random file", then
> the notion that creating a whole new namespace somehow must not have
> any security hooks because it's *so* special is just ridiculous.
>
> I also note that right now USER_NS is both "default n" and "if not
> sure, say 'n'" in the Kconfig files, even though obviously that ship
> has sailed long ago.

Definitely.  I did try and slow the up take as long as I could when
the code was maturing.

> So originally it might have been a reasonable expectation to say "only
> enable this if you're doing containers in servers", but that clearly
> isn't the case any more. So we basically take USER_NS for granted, but
> the fact that people might want chrome to use it for sandboxing does
> *not* mean that randomly we want any CLONE_NEWNS to just be ok,
> regardless of how trusted (or not) the case is.

Most importantly chrome shows how the mechanism this patch is using to
deny user space applications is one that ``breaks'' userspace.  AKA
causing to silently fall back on less tested code paths that existed
before user namespaces were common.  Any application that does this
uses less trusted code paths.

I am pretty certain that the code should be sending a fatal signal
instead of returning an error code on user namespace creation.  I
unfortunately hadn't realized the implications until this conversation.



That said Chrome and sandboxing I think is a reasonable case to look at
to demonstrate the trade-offs.  This is not all about increased kernel
attack surface, user namespaces also enable reduced attack surface in
sandboxes, which can make applications safer.

Eric
Paul Moore Oct. 5, 2022, 1:38 p.m. UTC | #7
On Wed, Oct 5, 2022 at 8:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:

...

> > I'm not saying that an LSM is the only place to do it, but I don't
> > think there have been any better suggestions either.
>
> I don't know.  I tried to have the conversation and Paul shut it down.

I would encourage anyone reading the above statement to look at the
previous discussion, links were provided at the top of this thread in
the original pull request.

> Effectively he said that where two or more out of tree LSM policies want
> something it makes no sense to discussion the actual reasons people want
> to use the hook.

Runtime kernel configuration is inherently "out of tree", this
includes not only loadable LSM security policies (e.g. a SELinux
policy), the system's firewall configuration, things like sysctl.conf,
and countless others.  Please understand that "out of tree" in this
context is not the same as when it is used in the context of kernel
code; the former is actually a positive thing ("look we can configure
the kernel behavior the way we want!") while the latter is a
maintenance and support nightmare.
Eric W. Biederman Oct. 5, 2022, 3:32 p.m. UTC | #8
Paul Moore <paul@paul-moore.com> writes:

> On Wed, Oct 5, 2022 at 8:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> ...
>
>> > I'm not saying that an LSM is the only place to do it, but I don't
>> > think there have been any better suggestions either.
>>
>> I don't know.  I tried to have the conversation and Paul shut it down.
>
> I would encourage anyone reading the above statement to look at the
> previous discussion, links were provided at the top of this thread in
> the original pull request.

Or they can just look below at your defense of out of tree code.
Where you again refuse to consider the other point of view.

>> Effectively he said that where two or more out of tree LSM policies want
>> something it makes no sense to discussion the actual reasons people want
>> to use the hook.
>
> Runtime kernel configuration is inherently "out of tree", this
> includes not only loadable LSM security policies (e.g. a SELinux
> policy), the system's firewall configuration, things like sysctl.conf,
> and countless others.  Please understand that "out of tree" in this
> context is not the same as when it is used in the context of kernel
> code; the former is actually a positive thing ("look we can configure
> the kernel behavior the way we want!") while the latter is a
> maintenance and support nightmare.

Paul are you saying my experience with /proc/net pointing incorrectly at
/proc/self/net instead of /proc/thread-self/net is invalid?

Paul are you saying that my experience with LSM needed a hack for
buggy LSMs in __ptrace_may_access since Jun 2006 is invalid?

My experience with the conditionals and the policy (not just on/off
configuration) existing in userspace is very much the maintenance and
support nightmare you have been describing.

In this case it is compounded because the mechanism chosen to alert
userspace of a failure (the error code) is chosen by userspace.
Further this is a mechanism that does cause silent application failures.

Do you see how it could be a concern with silent application failures
and no control in the kernel to fix anything if/when this turns into a
real world problem?  Yes, I finally found the time and tested and
verified that is the case.


I am also wondering how you see it makes sense to add userspace
functionality without discussing it on linux-api, and not documenting
what is being added?  It is easy to overlook but it was suggested.

I am wondering how you feel about adding a mechanism to the kernel
that results in userspace breakage if used?

Eric
Paul Moore Oct. 5, 2022, 4:04 p.m. UTC | #9
On Wed, Oct 5, 2022 at 11:33 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Wed, Oct 5, 2022 at 8:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Linus Torvalds <torvalds@linux-foundation.org> writes:

...

> >> Effectively he said that where two or more out of tree LSM policies want
> >> something it makes no sense to discussion the actual reasons people want
> >> to use the hook.
> >
> > Runtime kernel configuration is inherently "out of tree", this
> > includes not only loadable LSM security policies (e.g. a SELinux
> > policy), the system's firewall configuration, things like sysctl.conf,
> > and countless others.  Please understand that "out of tree" in this
> > context is not the same as when it is used in the context of kernel
> > code; the former is actually a positive thing ("look we can configure
> > the kernel behavior the way we want!") while the latter is a
> > maintenance and support nightmare.
>
> Paul are you saying my experience with /proc/net pointing incorrectly at
> /proc/self/net instead of /proc/thread-self/net is invalid?

My comment was that runtime kernel configuration is always going to be
out of tree due to its very nature, and conflating runtime
configuration with kernel code is a mistake.
Linus Torvalds Oct. 5, 2022, 6:54 p.m. UTC | #10
On Wed, Oct 5, 2022 at 5:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> We already have /proc/sys/user/max_user_namespaces.  It is a per userns
> control so you can run it in as fine grain as you like.  A little
> cumbersome perhaps but real.

It's not that it's cumbersome.

It's that it is *USELESS*.

Sure, it limits the memory footprint of somebody who does the
fork-bomb equivalent of user namespaces, but that's the least of the
problems.

Just think of normal users. They'd want a limited number of user
namespaces for things like sandboxing (whether google chrome or
whatever).

So distros do want to allow people a few of them.

But they want to be able to do so in a *controlled* manner. Not a "ok,
this user can create five user namespaces and do whatever they want in
them". Because we've had the issues where some kernel part has gotten
things wrong, and thought "local NS root means root" or similar.

So it's not about the number of namespaces. AT ALL. It's about *who*
and *what* does them.

> I don't know.  I tried to have the conversation and Paul shut it down.

I really get the feeling that the problem here is that you're not even
acknowledging the whole issue to begin with, since you mention that
"max_user_namespaces" not once, but twice in the email.

> It would be the easiest thing in the world in security_capable to
> ask is this a trusted app, if not the answer is no.

Isn't this *literally* what security_create_user_ns() would basically be doing?

IOW, letting the LSM just say "this app is trusted to create a new
user namespace".

And that is what the LSM model is literally designed for. Because the
kernel doesn't inherently know "I trust this app". It doesn't know the
difference between "google-chrome" and "l33t-crack3r". It needs some
kind of external set of rules.

See?

               Linus
Eric W. Biederman Oct. 5, 2022, 9:27 p.m. UTC | #11
Paul Moore <paul@paul-moore.com> writes:

> On Wed, Oct 5, 2022 at 11:33 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Paul Moore <paul@paul-moore.com> writes:
>> > On Wed, Oct 5, 2022 at 8:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> ...
>
>> >> Effectively he said that where two or more out of tree LSM policies want
>> >> something it makes no sense to discussion the actual reasons people want
>> >> to use the hook.
>> >
>> > Runtime kernel configuration is inherently "out of tree", this
>> > includes not only loadable LSM security policies (e.g. a SELinux
>> > policy), the system's firewall configuration, things like sysctl.conf,
>> > and countless others.  Please understand that "out of tree" in this
>> > context is not the same as when it is used in the context of kernel
>> > code; the former is actually a positive thing ("look we can configure
>> > the kernel behavior the way we want!") while the latter is a
>> > maintenance and support nightmare.
>>
>> Paul are you saying my experience with /proc/net pointing incorrectly at
>> /proc/self/net instead of /proc/thread-self/net is invalid?
>
> My comment was that runtime kernel configuration is always going to be
> out of tree due to its very nature, and conflating runtime
> configuration with kernel code is a mistake.

When the runtime configuration has it's own llvm backend I have
trouble seeing the difference.  It is code that controls kernel behavior
that does not live in the kernel.

Apparmor and selinux polices are not quite that generic but as I have
discovered much to my pain routine clean-ups that userspace does not
care about are blocked because apparmor and selinux policies have bugs.

When I can not fix kernel bugs because of policy bugs, it has the
same kind of effect as keeping an interface the same for out of
tree code.  A little worse actually.

Most kernel runtime configuration is a on/off of a numeric setting
with logic that is enabled by that in the kernel.  Those I can agree
are not the same.  

Given that the logic and it's bugs are going to be out of tree I do not
agree that we should only consider what goes into the kernel when
looking into that kind of code.  Instead we should treat it will all of
the due diligence that we attempt to use when creating a system call.
That very much has not happened here.

Eric
Paul Moore Oct. 5, 2022, 10:39 p.m. UTC | #12
On Wed, Oct 5, 2022 at 5:28 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Wed, Oct 5, 2022 at 11:33 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Paul Moore <paul@paul-moore.com> writes:
> >> > On Wed, Oct 5, 2022 at 8:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> >> Linus Torvalds <torvalds@linux-foundation.org> writes:
> >
> > ...
> >
> >> >> Effectively he said that where two or more out of tree LSM policies want
> >> >> something it makes no sense to discussion the actual reasons people want
> >> >> to use the hook.
> >> >
> >> > Runtime kernel configuration is inherently "out of tree", this
> >> > includes not only loadable LSM security policies (e.g. a SELinux
> >> > policy), the system's firewall configuration, things like sysctl.conf,
> >> > and countless others.  Please understand that "out of tree" in this
> >> > context is not the same as when it is used in the context of kernel
> >> > code; the former is actually a positive thing ("look we can configure
> >> > the kernel behavior the way we want!") while the latter is a
> >> > maintenance and support nightmare.
> >>
> >> Paul are you saying my experience with /proc/net pointing incorrectly at
> >> /proc/self/net instead of /proc/thread-self/net is invalid?
> >
> > My comment was that runtime kernel configuration is always going to be
> > out of tree due to its very nature, and conflating runtime
> > configuration with kernel code is a mistake.

...

> Given that the logic and it's bugs are going to be out of tree I do not
> agree that we should only consider what goes into the kernel when
> looking into that kind of code.  Instead we should treat it will all of
> the due diligence that we attempt to use when creating a system call.
> That very much has not happened here.

Eric, I disagree with most of what you said, to the point where we
could probably go round and round in circles for days on this and not
be any closer to an agreeable conclusion.  I don't know about you, but
that is not my idea of time well spent, especially since Linus has
already voiced his opinion on the matter.  I will end my comments here
with the hope that someday soon you can at least find the ability to
respect the consensus decision, even if you can't bring yourself to
agree with it.
Vegard Nossum Oct. 6, 2022, 8:29 a.m. UTC | #13
On 10/4/22 22:55, Linus Torvalds wrote:
> And I think you are in denial about how many problems the
> user-namespace stuff has caused.
> 
> Distros are literally turning it off entirely because the whole "let
> users create their own namespace" has *NOT* been a great success.

[...]

> I'm not saying that an LSM is the only place to do it, but I don't
> think there have been any better suggestions either.

I had posted a patch here for restricting module loading requests from
user namespaces as a way to reduce attack surface:

v2: 
https://lore.kernel.org/all/20220815082753.6088-1-vegard.nossum@oracle.com/

v1: 
https://lore.kernel.org/all/20220809185229.28417-1-vegard.nossum@oracle.com/

I'll respond to Serge's comment on the v2 thread.


Vegard
Eric W. Biederman Oct. 6, 2022, 12:52 p.m. UTC | #14
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Oct 5, 2022 at 5:39 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> We already have /proc/sys/user/max_user_namespaces.  It is a per userns
>> control so you can run it in as fine grain as you like.  A little
>> cumbersome perhaps but real.
>
> It's not that it's cumbersome.
>
> It's that it is *USELESS*.

Not quite.  Because you can have lower settings inside created user
namespaces, it is easy to create a user namespace so inside of
a sandbox you can't create more user namespaces.

> Sure, it limits the memory footprint of somebody who does the
> fork-bomb equivalent of user namespaces, but that's the least of the
> problems.
>
> Just think of normal users. They'd want a limited number of user
> namespaces for things like sandboxing (whether google chrome or
> whatever).
>
> So distros do want to allow people a few of them.
>
> But they want to be able to do so in a *controlled* manner. Not a "ok,
> this user can create five user namespaces and do whatever they want in
> them". Because we've had the issues where some kernel part has gotten
> things wrong, and thought "local NS root means root" or similar.
>
> So it's not about the number of namespaces. AT ALL. It's about *who*
> and *what* does them.

I see your concern that I am not acknowledging the current state of play
of user namespaces, and what problems they may be causing people.  Let
me describe what I understand the current state to be and you can
correct me where I am wrong.

At this point every old-school distro I am aware of enables user
namespaces by default. (I don't know what android and chrome-os do).  As
best I have been able to determine the bugs that are turning up are the
same kinds of bugs and at roughly the same rates as bugs turn up
anywhere else in the user/kernel interface.

There is a common trend that shows up when people examine posted
exploits (especially exploits connected to CVEs), a lot of the exploits
create user namespaces.  As best as I have been able to determine this
has two causes.  Arranging the conditions so that the bugs can be
triggered without using user namespaces is more difficult.  There is
more attack surface when user namespaces are enabled.

Even so it is my sense that for a typical linux desktop the additional
risk of running with user namespaces enabled is negligible.  My sense is
that there is a modest probability that there is a exploitable kernel
bug with user namespaces disabled, and that enabling them only slightly
increases the probability that there is an exploitable kernel bug.

The conclusion I draw from that is that it is ok to leave the use of
namespaces enabled by default for most people most of the time.

As best as I can determine it takes running on separate machines, or
possibly running inside of virtual machines to provide reasonable safety
from an attacker who manages to run arbitrary unprivileged userspace
code.  I don't believe disabling user namespaces appreciably moves the
needle.

In the linux desktop space there is chrome, and things like flatpak
that use sandboxes to run applications.  Making the gain of being able
to construct your own sandboxes something that has practical utility.
I believe that sandboxes make it more difficult for attackers to
achieve the ability to run arbitrary unprivileged userspace code.


So I think user namespaces are better default on.  Having user
namespaces default on means that people find them available and use
them.  Asking people to code for enviroments where user namespaces
may be on, may be off to the best I believe will encourage people
not to use the best tools that are available for securing their
applications, and so will be a net detriment to the community.

That is why max_user_namespaces looks like all it is good for is
preventing the user namespace equivalent of fork bombs, but is in fact
trivially capable of disabling user namespaces in sandboxes.

I do think there are cases where things get locked down and sandboxed
and tight that people should disable anything and everything they don't
need.  That in those case you want to implement the principle of least
privilege.  I see no problem with supporting those kinds of systems.

>> I don't know.  I tried to have the conversation and Paul shut it down.
>
> I really get the feeling that the problem here is that you're not even
> acknowledging the whole issue to begin with, since you mention that
> "max_user_namespaces" not once, but twice in the email.

Mentioning max_user_namespaces was my point that I very much see the
need for some controls.  That there are some already existing and
that is the one I built in the kernel already.  Not a contention
that these are good enough.

Saying Paul was the one who shut things down was probably going a bit
too far.  But I am very frustrated that the my attempts to figure out
what was going on.


>> It would be the easiest thing in the world in security_capable to
>> ask is this a trusted app, if not the answer is no.
>
> Isn't this *literally* what security_create_user_ns() would basically
> be doing?

The actual implementation of security_create_user_ns() current creates
regressions in userspace.

So I think the current implementation needs to be reverted, and the
current failure paths in create_user_ns with the same behavior need to
be fixed and we can talk about what are appropriate mechanisms to add
to the kernel.

I trust you can handle the revert of the regression.  At the moment I
have spent way more time on this then I have time for.  My family needs
me.

Eric
Theodore Ts'o Oct. 6, 2022, 2:42 p.m. UTC | #15
Eric, there's one thing I don't get about why you hate LSM's having
the ability to prevent user name spaces, and yet you are OK with
(indeed, touting) /proc/sys/user/max_user_namespaces.

If we set max_user_namespaces to N, won't the N+1'th application get
an error when they try to create a user namespace?  One of your
arguments about why having LSM's forcing a error of say, EPERM, is
that applications that don't check for error returns might get
confused.  (Those are buggy applications; so they should be fixed ---
isn't that your argument about why we shouldn't be freaking out over
security bugs caused by kernel bugs and user namespaces.)  But if you
set max_user_namespaces, that same application will now fail with say,
ENOSPC.

And if in fact there is a buggy application that will create a
security exposure because they aren't checking !@#@?! error returns,
an attacker can simply create N user namespaces --- and then trigger
the buggy applicaion, at which point, they will have 0wned the system.

    	  	      	       	      - Ted