diff mbox

[resend,2/2] userns: control capabilities of some user namespaces

Message ID 20171103004436.40026-1-mahesh@bandewar.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mahesh Bandewar Nov. 3, 2017, 12:44 a.m. UTC
From: Mahesh Bandewar <maheshb@google.com>

With this new notion of "controlled" user-namespaces, the controlled
user-namespaces are marked at the time of their creation while the
capabilities of processes that belong to them are controlled using the
global mask.

Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
that belongs to uncontrolled user-ns can create another (child) user-
namespace that is uncontrolled. Any other process (that either does
not have SYS_ADMIN or belongs to a controlled user-ns) can only
create a user-ns that is controlled.

global-capability-whitelist (controlled_userns_caps_whitelist) is used
at the capability check-time and keeps the semantics for the processes
that belong to uncontrolled user-ns as it is. Processes that belong to
controlled user-ns however are subjected to different checks-

   (a) if the capability in question is controlled and process belongs
       to controlled user-ns, then it's always denied.
   (b) if the capability in question is NOT controlled then fall back
       to the traditional check.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 include/linux/capability.h     |  1 +
 include/linux/user_namespace.h | 20 ++++++++++++++++++++
 kernel/capability.c            |  5 +++++
 kernel/user_namespace.c        |  3 +++
 security/commoncap.c           |  8 ++++++++
 5 files changed, 37 insertions(+)

Comments

Serge E. Hallyn Nov. 4, 2017, 11:53 p.m. UTC | #1
Quoting Mahesh Bandewar (mahesh@bandewar.net):
> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> that belongs to uncontrolled user-ns can create another (child) user-
> namespace that is uncontrolled. Any other process (that either does
> not have SYS_ADMIN or belongs to a controlled user-ns) can only
> create a user-ns that is controlled.

That's a huge change though.  It means that any system that previously
used unprivileged containers will need new privileged code (which always
risks more privilege leaks through the new code) to re-enable what was
possible without privilege before.  That's a regression.

I'm very much interested in what you want to do,  But it seems like
it would be worth starting with some automated code analysis that shows
exactly what code becomes accessible to unprivileged users with user
namespaces which was accessible to unprivileged users before.  Then we
can reason about classifying that code and perhaps limiting access to
some of it.
On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> Quoting Mahesh Bandewar (mahesh@bandewar.net):
> > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> > that belongs to uncontrolled user-ns can create another (child) user-
> > namespace that is uncontrolled. Any other process (that either does
> > not have SYS_ADMIN or belongs to a controlled user-ns) can only
> > create a user-ns that is controlled.
>
> That's a huge change though.  It means that any system that previously
> used unprivileged containers will need new privileged code (which always
> risks more privilege leaks through the new code) to re-enable what was
> possible without privilege before.  That's a regression.
>
I wouldn't call it a regression since the existing behavior is
preserved as it is if the default-mask is not altered. i.e.
uncontrolled process can create user-ns and have full control inside
that user-ns. The only difference is - as an example if 'something'
comes up which makes a specific capability expose ring-0, so admin can
quickly remove the capability in question from the mask, so that no
untrusted code can exploit that capability until either the kernel is
patched or workloads are sanitized keeping in mind what was
discovered. (I have given some real life example vulnerabilities
published recently about CAP_NET_RAW in the cover letter)

> I'm very much interested in what you want to do,  But it seems like
> it would be worth starting with some automated code analysis that shows
> exactly what code becomes accessible to unprivileged users with user
> namespaces which was accessible to unprivileged users before.  Then we
> can reason about classifying that code and perhaps limiting access to
> some of it.
I would like to look at this as 'a tool' that is available to admins
who can quickly take possible-compromise-situation under-control
probably at the cost of some functionality-loss until kernel is
patched and the mask is restored to default value.

I'm not sure if automated tools could discover anything since these
changes should not alter behavior in any way.
Serge E. Hallyn Nov. 6, 2017, 3:03 p.m. UTC | #3
Quoting Mahesh Bandewar (महेश बंडेवार) (maheshb@google.com):
> On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> >
> > Quoting Mahesh Bandewar (mahesh@bandewar.net):
> > > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> > > that belongs to uncontrolled user-ns can create another (child) user-
> > > namespace that is uncontrolled. Any other process (that either does
> > > not have SYS_ADMIN or belongs to a controlled user-ns) can only
> > > create a user-ns that is controlled.
> >
> > That's a huge change though.  It means that any system that previously
> > used unprivileged containers will need new privileged code (which always
> > risks more privilege leaks through the new code) to re-enable what was
> > possible without privilege before.  That's a regression.
> >
> I wouldn't call it a regression since the existing behavior is
> preserved as it is if the default-mask is not altered. i.e.
> uncontrolled process can create user-ns and have full control inside
> that user-ns. The only difference is - as an example if 'something'
> comes up which makes a specific capability expose ring-0, so admin can
> quickly remove the capability in question from the mask, so that no
> untrusted code can exploit that capability until either the kernel is

Oh, sorry, I misread then, and missed that step.  I thought the default
with this patchset was that there were no capabilities exposed to user
namespaces.

> patched or workloads are sanitized keeping in mind what was
> discovered. (I have given some real life example vulnerabilities
> published recently about CAP_NET_RAW in the cover letter)
> 
> > I'm very much interested in what you want to do,  But it seems like
> > it would be worth starting with some automated code analysis that shows
> > exactly what code becomes accessible to unprivileged users with user
> > namespaces which was accessible to unprivileged users before.  Then we
> > can reason about classifying that code and perhaps limiting access to
> > some of it.
> I would like to look at this as 'a tool' that is available to admins
> who can quickly take possible-compromise-situation under-control
> probably at the cost of some functionality-loss until kernel is
> patched and the mask is restored to default value.

The thing that makes me hesitate with this set is that it is a
permanent new feature to address what (I hope) is a temporary
problem.  What would you think about doing this as a stackable
(yama-style) LSM?

> I'm not sure if automated tools could discover anything since these
> changes should not alter behavior in any way.

Seems like there are two naive ways to do it, the first being to just
look at all code under ns_capable() plus code called from there.  It
seems like looking at the result of that could be fruitful.

-serge
Daniel Micay Nov. 6, 2017, 9:33 p.m. UTC | #4
Substantial added attack surface will never go away as a problem. There
aren't a finite number of vulnerabilities to be found.
Serge E. Hallyn Nov. 6, 2017, 10:14 p.m. UTC | #5
Quoting Daniel Micay (danielmicay@gmail.com):
> Substantial added attack surface will never go away as a problem. There
> aren't a finite number of vulnerabilities to be found.

There's varying levels of usefulness and quality.  There is code which I
want to be able to use in a container, and code which I can't ever see a
reason for using there.  The latter, especially if it's also in a
staging driver, would be nice to have a toggle to disable.

You're not advocating dropping the added attack surface, only adding a
way of dealing with an 0day after the fact.  Privilege raising 0days can
exist anywhere, not just in code which only root in a user namespace can
exercise.  So from that point of view, ksplice seems a more complete
solution.  Why not just actually fix the bad code block when we know
about it?

Finally, it has been well argued that you can gain many new caps from
having only a few others.  Given that, how could you ever be sure that,
if an 0day is found which allows root in a user ns to abuse
CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
would suffice?  It seems to me that the existing control in
/proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
in that case.

-serge
Christian Brauner Nov. 6, 2017, 10:42 p.m. UTC | #6
On Mon, Nov 06, 2017 at 04:14:18PM -0600, Serge Hallyn wrote:
> Quoting Daniel Micay (danielmicay@gmail.com):
> > Substantial added attack surface will never go away as a problem. There
> > aren't a finite number of vulnerabilities to be found.
> 
> There's varying levels of usefulness and quality.  There is code which I
> want to be able to use in a container, and code which I can't ever see a
> reason for using there.  The latter, especially if it's also in a
> staging driver, would be nice to have a toggle to disable.
> 
> You're not advocating dropping the added attack surface, only adding a
> way of dealing with an 0day after the fact.  Privilege raising 0days can
> exist anywhere, not just in code which only root in a user namespace can
> exercise.  So from that point of view, ksplice seems a more complete
> solution.  Why not just actually fix the bad code block when we know
> about it?
> 
> Finally, it has been well argued that you can gain many new caps from
> having only a few others.  Given that, how could you ever be sure that,
> if an 0day is found which allows root in a user ns to abuse
> CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> would suffice?  It seems to me that the existing control in
> /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> in that case.

I agree that /proc/sys/kernel/unprivileged_userns_clone is the most reasonable
thing to do. This patch introduces a layer of complexity to fine-tune user
namespace creation that - in the relevant security critical scenario - should
simply be turned of entirely.

Is /proc/sys/kernel/unprivileged_userns_clone upstreamed or is this still only
carried downstream?
Boris Lukashev Nov. 6, 2017, 11:17 p.m. UTC | #7
On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Daniel Micay (danielmicay@gmail.com):
>> Substantial added attack surface will never go away as a problem. There
>> aren't a finite number of vulnerabilities to be found.
>
> There's varying levels of usefulness and quality.  There is code which I
> want to be able to use in a container, and code which I can't ever see a
> reason for using there.  The latter, especially if it's also in a
> staging driver, would be nice to have a toggle to disable.
>
> You're not advocating dropping the added attack surface, only adding a
> way of dealing with an 0day after the fact.  Privilege raising 0days can
> exist anywhere, not just in code which only root in a user namespace can
> exercise.  So from that point of view, ksplice seems a more complete
> solution.  Why not just actually fix the bad code block when we know
> about it?
>
> Finally, it has been well argued that you can gain many new caps from
> having only a few others.  Given that, how could you ever be sure that,
> if an 0day is found which allows root in a user ns to abuse
> CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> would suffice?  It seems to me that the existing control in
> /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> in that case.
>
> -serge

This seems to be heading toward "we need full zones in Linux" with
their own procfs and sysfs namespace and a stricter isolation model
for resources and capabilities. So long as things can happen in a
namespace which have a privileged relationship with host resources,
this is going to be cat-and-mouse to one degree or another.

Containers and namespaces dont have a one-to-one relationship, so i'm
not sure that's the best term to use in the kernel security context
since there's a bunch of userspace and implementation delta across the
different systems (with their own security models and so forth).
Without accounting for what a specific implementation may or may not
do, and only looking at "how do we reduce privileged impact on parent
context from unprivileged namespaces," this patch does seem to provide
a logical way of reducing the privileges available in such a namespace
and often needed to mount escapes/impact parent context.

-Boris
Serge E. Hallyn Nov. 6, 2017, 11:39 p.m. UTC | #8
Quoting Boris Lukashev (blukashev@sempervictus.com):
> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Daniel Micay (danielmicay@gmail.com):
> >> Substantial added attack surface will never go away as a problem. There
> >> aren't a finite number of vulnerabilities to be found.
> >
> > There's varying levels of usefulness and quality.  There is code which I
> > want to be able to use in a container, and code which I can't ever see a
> > reason for using there.  The latter, especially if it's also in a
> > staging driver, would be nice to have a toggle to disable.
> >
> > You're not advocating dropping the added attack surface, only adding a
> > way of dealing with an 0day after the fact.  Privilege raising 0days can
> > exist anywhere, not just in code which only root in a user namespace can
> > exercise.  So from that point of view, ksplice seems a more complete
> > solution.  Why not just actually fix the bad code block when we know
> > about it?
> >
> > Finally, it has been well argued that you can gain many new caps from
> > having only a few others.  Given that, how could you ever be sure that,
> > if an 0day is found which allows root in a user ns to abuse
> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> > would suffice?  It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> > in that case.
> >
> > -serge
> 
> This seems to be heading toward "we need full zones in Linux" with
> their own procfs and sysfs namespace and a stricter isolation model
> for resources and capabilities. So long as things can happen in a
> namespace which have a privileged relationship with host resources,
> this is going to be cat-and-mouse to one degree or another.
> 
> Containers and namespaces dont have a one-to-one relationship, so i'm
> not sure that's the best term to use in the kernel security context

Sorry - what's not the best term to use?

> since there's a bunch of userspace and implementation delta across the
> different systems (with their own security models and so forth).
> Without accounting for what a specific implementation may or may not
> do, and only looking at "how do we reduce privileged impact on parent
> context from unprivileged namespaces," this patch does seem to provide
> a logical way of reducing the privileges available in such a namespace
> and often needed to mount escapes/impact parent context.

What different implementations do is irrelevant - as an unprivileged user
I can always, with no help, create a new user namespace mapping my current
uid to root, and exercise this code.  So the security model implemented
by a particular userspace namespace-using driver doesn't matter, as it
only restricts me if I choose to use it.

But, I guess you're actually saying that some program might know that it
should never use network code so want to drop CAP_NET_*?  And you're
saying that a "global capability bounding set" might be useful?

Would it be better to actually implement it as a new bounding set that
is maintained across user namespace creations, but is per-task (inherted
by children of course)?  Instead of a sysctl?

-serge
Boris Lukashev Nov. 7, 2017, 12:01 a.m. UTC | #9
On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Boris Lukashev (blukashev@sempervictus.com):
>> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Quoting Daniel Micay (danielmicay@gmail.com):
>> >> Substantial added attack surface will never go away as a problem. There
>> >> aren't a finite number of vulnerabilities to be found.
>> >
>> > There's varying levels of usefulness and quality.  There is code which I
>> > want to be able to use in a container, and code which I can't ever see a
>> > reason for using there.  The latter, especially if it's also in a
>> > staging driver, would be nice to have a toggle to disable.
>> >
>> > You're not advocating dropping the added attack surface, only adding a
>> > way of dealing with an 0day after the fact.  Privilege raising 0days can
>> > exist anywhere, not just in code which only root in a user namespace can
>> > exercise.  So from that point of view, ksplice seems a more complete
>> > solution.  Why not just actually fix the bad code block when we know
>> > about it?
>> >
>> > Finally, it has been well argued that you can gain many new caps from
>> > having only a few others.  Given that, how could you ever be sure that,
>> > if an 0day is found which allows root in a user ns to abuse
>> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
>> > would suffice?  It seems to me that the existing control in
>> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
>> > in that case.
>> >
>> > -serge
>>
>> This seems to be heading toward "we need full zones in Linux" with
>> their own procfs and sysfs namespace and a stricter isolation model
>> for resources and capabilities. So long as things can happen in a
>> namespace which have a privileged relationship with host resources,
>> this is going to be cat-and-mouse to one degree or another.
>>
>> Containers and namespaces dont have a one-to-one relationship, so i'm
>> not sure that's the best term to use in the kernel security context
>
> Sorry - what's not the best term to use?

Pardon, "containers," since they're namespaces+system construct.

>
>> since there's a bunch of userspace and implementation delta across the
>> different systems (with their own security models and so forth).
>> Without accounting for what a specific implementation may or may not
>> do, and only looking at "how do we reduce privileged impact on parent
>> context from unprivileged namespaces," this patch does seem to provide
>> a logical way of reducing the privileges available in such a namespace
>> and often needed to mount escapes/impact parent context.
>
> What different implementations do is irrelevant - as an unprivileged user
> I can always, with no help, create a new user namespace mapping my current
> uid to root, and exercise this code.  So the security model implemented
> by a particular userspace namespace-using driver doesn't matter, as it
> only restricts me if I choose to use it.
>
> But, I guess you're actually saying that some program might know that it
> should never use network code so want to drop CAP_NET_*?  And you're
> saying that a "global capability bounding set" might be useful?
>

The "global capability bounding set" with forced inheritance can be
used to prevent the vector you describe wherein the capability of UID
0 in the child NS is restricted from the parent implicitly, so yes,
that nomenclature seems appropriate.

> Would it be better to actually implement it as a new bounding set that
> is maintained across user namespace creations, but is per-task (inherted
> by children of course)?  Instead of a sysctl?
>
> -serge

In line with the previous comment, the inheritance across subsequent
invocations should be forced to prevent the context you described.
Please pardon my ignorance, not sure what you mean in terms of
"per-task" across namespace creation.

-Boris
Daniel Micay Nov. 7, 2017, 2:16 a.m. UTC | #10
On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote:
> Quoting Daniel Micay (danielmicay@gmail.com):
> > Substantial added attack surface will never go away as a problem.
> > There
> > aren't a finite number of vulnerabilities to be found.
> 
> There's varying levels of usefulness and quality.  There is code which
> I
> want to be able to use in a container, and code which I can't ever see
> a
> reason for using there.  The latter, especially if it's also in a
> staging driver, would be nice to have a toggle to disable.
> 
> You're not advocating dropping the added attack surface, only adding a
> way of dealing with an 0day after the fact.  Privilege raising 0days
> can
> exist anywhere, not just in code which only root in a user namespace
> can
> exercise.  So from that point of view, ksplice seems a more complete
> solution.  Why not just actually fix the bad code block when we know
> about it?

That's not what I'm advocating. I only care about it for proactive
attack surface reduction downstream. I have no interest in using it to
block access to known vulnerabilities.

> Finally, it has been well argued that you can gain many new caps from
> having only a few others.  Given that, how could you ever be sure
> that,
> if an 0day is found which allows root in a user ns to abuse
> CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> would suffice?

I didn't suggest using it that way...

>  It seems to me that the existing control in
> /proc/sys/kernel/unprivileged_userns_clone might be the better duct
> tape
> in that case.

There's no such thing as unprivileged_userns_clone in mainline.

The advantage of this over unprivileged_userns_clone in Debian and maybe
some other distributions is not giving up unprivileged app containers /
sandboxes implemented via user namespaces.  For example, Chromium's user
namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium
will be dropping their setuid sandbox, forcing usage of user namespaces
to avoid losing the sandbox which will greatly increase local kernel
attack surface on the host by exposing netfilter management, etc. to
unprivileged users.

The proposed approach isn't necessarily the best way to implement this
kind of mitigation but I think it's filling a real need.
Serge E. Hallyn Nov. 7, 2017, 3:23 a.m. UTC | #11
On Mon, Nov 06, 2017 at 09:16:03PM -0500, Daniel Micay wrote:
> On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote:
> > Quoting Daniel Micay (danielmicay@gmail.com):
> > > Substantial added attack surface will never go away as a problem.
> > > There
> > > aren't a finite number of vulnerabilities to be found.
> > 
> > There's varying levels of usefulness and quality.  There is code which
> > I
> > want to be able to use in a container, and code which I can't ever see
> > a
> > reason for using there.  The latter, especially if it's also in a
> > staging driver, would be nice to have a toggle to disable.
> > 
> > You're not advocating dropping the added attack surface, only adding a
> > way of dealing with an 0day after the fact.  Privilege raising 0days
> > can
> > exist anywhere, not just in code which only root in a user namespace
> > can
> > exercise.  So from that point of view, ksplice seems a more complete
> > solution.  Why not just actually fix the bad code block when we know
> > about it?
> 
> That's not what I'm advocating. I only care about it for proactive
> attack surface reduction downstream. I have no interest in using it to
> block access to known vulnerabilities.
> 
> > Finally, it has been well argued that you can gain many new caps from
> > having only a few others.  Given that, how could you ever be sure
> > that,
> > if an 0day is found which allows root in a user ns to abuse
> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> > would suffice?
> 
> I didn't suggest using it that way...
> 
> >  It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct
> > tape
> > in that case.
> 
> There's no such thing as unprivileged_userns_clone in mainline.

Hm.  I was sure Kees had gotten that in...  I guess I was wrong.

> The advantage of this over unprivileged_userns_clone in Debian and maybe
> some other distributions is not giving up unprivileged app containers /
> sandboxes implemented via user namespaces.  For example, Chromium's user
> namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium
> will be dropping their setuid sandbox, forcing usage of user namespaces
> to avoid losing the sandbox which will greatly increase local kernel
> attack surface on the host by exposing netfilter management, etc. to
> unprivileged users.
> 
> The proposed approach isn't necessarily the best way to implement this
> kind of mitigation but I think it's filling a real need.

I think I definately prefer what I mentioned in the email to Boris.
Basically a "permanent capability bounding set".  The normal bounding
set gets reset to a full set on every new user_ns creation.  In this
proposal, it would instead be set to the calling task's permanent
capability set, which starts (at boot) full, and which privileged
tasks can pull capabilities out of.

-serge
Serge E. Hallyn Nov. 7, 2017, 3:28 a.m. UTC | #12
On Mon, Nov 06, 2017 at 07:01:58PM -0500, Boris Lukashev wrote:
> On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Boris Lukashev (blukashev@sempervictus.com):
> >> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> >> > Quoting Daniel Micay (danielmicay@gmail.com):
> >> >> Substantial added attack surface will never go away as a problem. There
> >> >> aren't a finite number of vulnerabilities to be found.
> >> >
> >> > There's varying levels of usefulness and quality.  There is code which I
> >> > want to be able to use in a container, and code which I can't ever see a
> >> > reason for using there.  The latter, especially if it's also in a
> >> > staging driver, would be nice to have a toggle to disable.
> >> >
> >> > You're not advocating dropping the added attack surface, only adding a
> >> > way of dealing with an 0day after the fact.  Privilege raising 0days can
> >> > exist anywhere, not just in code which only root in a user namespace can
> >> > exercise.  So from that point of view, ksplice seems a more complete
> >> > solution.  Why not just actually fix the bad code block when we know
> >> > about it?
> >> >
> >> > Finally, it has been well argued that you can gain many new caps from
> >> > having only a few others.  Given that, how could you ever be sure that,
> >> > if an 0day is found which allows root in a user ns to abuse
> >> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> >> > would suffice?  It seems to me that the existing control in
> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> >> > in that case.
> >> >
> >> > -serge
> >>
> >> This seems to be heading toward "we need full zones in Linux" with
> >> their own procfs and sysfs namespace and a stricter isolation model
> >> for resources and capabilities. So long as things can happen in a
> >> namespace which have a privileged relationship with host resources,
> >> this is going to be cat-and-mouse to one degree or another.
> >>
> >> Containers and namespaces dont have a one-to-one relationship, so i'm
> >> not sure that's the best term to use in the kernel security context
> >
> > Sorry - what's not the best term to use?
> 
> Pardon, "containers," since they're namespaces+system construct.
> 
> >
> >> since there's a bunch of userspace and implementation delta across the
> >> different systems (with their own security models and so forth).
> >> Without accounting for what a specific implementation may or may not
> >> do, and only looking at "how do we reduce privileged impact on parent
> >> context from unprivileged namespaces," this patch does seem to provide
> >> a logical way of reducing the privileges available in such a namespace
> >> and often needed to mount escapes/impact parent context.
> >
> > What different implementations do is irrelevant - as an unprivileged user
> > I can always, with no help, create a new user namespace mapping my current
> > uid to root, and exercise this code.  So the security model implemented
> > by a particular userspace namespace-using driver doesn't matter, as it
> > only restricts me if I choose to use it.
> >
> > But, I guess you're actually saying that some program might know that it
> > should never use network code so want to drop CAP_NET_*?  And you're
> > saying that a "global capability bounding set" might be useful?
> >
> 
> The "global capability bounding set" with forced inheritance can be
> used to prevent the vector you describe wherein the capability of UID
> 0 in the child NS is restricted from the parent implicitly, so yes,
> that nomenclature seems appropriate.
> 
> > Would it be better to actually implement it as a new bounding set that
> > is maintained across user namespace creations, but is per-task (inherted
> > by children of course)?  Instead of a sysctl?
> >
> > -serge
> 
> In line with the previous comment, the inheritance across subsequent
> invocations should be forced to prevent the context you described.
> Please pardon my ignorance, not sure what you mean in terms of
> "per-task" across namespace creation.

I meant each task has a perm_cap_bset next to the cap_bset.  So task
p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset,
p2 (if it has privilege) can drop CAP_NET_ADMIN.  When p1 creates a
new user_ns, that init task has its cap_bset set to all caps but
CAP_SYS_ADMIN.

I think for simplicity perm_cap_bset would *only* affect the filling
of cap_bset at user namespace creation.  So if you wanted to drop a
capability from your own cap_bset as well, you'd have to do that
separately.
Sorry folks I was traveling and seems like lot happened on this thread. :p

I will try to response few of these comments selectively -

> The thing that makes me hesitate with this set is that it is a
> permanent new feature to address what (I hope) is a temporary
> problem.
I agree this is permanent new feature but it's not solving a temporary
problem. It's impossible to assess what and when new vulnerability
that could show up. I think Daniel summed it up appropriately in his
response

> Seems like there are two naive ways to do it, the first being to just
> look at all code under ns_capable() plus code called from there.  It
> seems like looking at the result of that could be fruitful.
This is really hard. The main issue that there were features designed
and developed before user-ns days with an assumption that unprivileged
users will never get certain capabilities which only root user gets.
Now that is not true anymore with user-ns creation with mapping root
for any process. Also at the same time blocking user-ns creation for
eveyone is a big-hammer which is not needed too. So it's not that easy
to just perform a code-walk-though and correct those decisions now.

> It seems to me that the existing control in
> /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> in that case.
This solution is essentially blocking unprivileged users from using
the user-namespaces entirely. This is not really a solution that can
work. The solution that this patch-set adds allows unprivileged users
to create user-namespaces. Actually the proposed solution is more
fine-grained approach than the unprivileged_userns_clone solution
since you can selectively block capabilities rather than completely
blocking the functionality.

> I meant each task has a perm_cap_bset next to the cap_bset.  So task
> p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset,
> p2 (if it has privilege) can drop CAP_NET_ADMIN.  When p1 creates a
> new user_ns, that init task has its cap_bset set to all caps but
> CAP_SYS_ADMIN.
>
> I think for simplicity perm_cap_bset would *only* affect the filling
> of cap_bset at user namespace creation.  So if you wanted to drop a
> capability from your own cap_bset as well, you'd have to do that
> separately.
My original intention is to reduce the attack surface when
vulnerabilities are discovered / published, but I don't see how this
is solving that issue. Also the reason to have sysctl is to have
simplistic control across the board to contain the situation. If that
is not addressed then we might need some other solution on top of
this.
Christian Brauner Nov. 8, 2017, 7:02 p.m. UTC | #14
On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> Sorry folks I was traveling and seems like lot happened on this thread. :p
> 
> I will try to response few of these comments selectively -
> 
> > The thing that makes me hesitate with this set is that it is a
> > permanent new feature to address what (I hope) is a temporary
> > problem.
> I agree this is permanent new feature but it's not solving a temporary
> problem. It's impossible to assess what and when new vulnerability
> that could show up. I think Daniel summed it up appropriately in his
> response
> 
> > Seems like there are two naive ways to do it, the first being to just
> > look at all code under ns_capable() plus code called from there.  It
> > seems like looking at the result of that could be fruitful.
> This is really hard. The main issue that there were features designed
> and developed before user-ns days with an assumption that unprivileged
> users will never get certain capabilities which only root user gets.
> Now that is not true anymore with user-ns creation with mapping root
> for any process. Also at the same time blocking user-ns creation for
> eveyone is a big-hammer which is not needed too. So it's not that easy
> to just perform a code-walk-though and correct those decisions now.
> 
> > It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> > in that case.
> This solution is essentially blocking unprivileged users from using
> the user-namespaces entirely. This is not really a solution that can
> work. The solution that this patch-set adds allows unprivileged users
> to create user-namespaces. Actually the proposed solution is more
> fine-grained approach than the unprivileged_userns_clone solution
> since you can selectively block capabilities rather than completely
> blocking the functionality.

I've been talking to Stéphane today about this and we should also keep in mind
that we have:

chb@conventiont|~
> ls -al /proc/sys/user/
total 0
dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces

These files allow you to limit the number of namespaces that can be created
*per namespace* type. So let's say your system runs a bunch of user namespaces
you can do:

chb@conventiont|~
> echo 0 > /proc/sys/user/max_user_namespaces

So that the next time you try to create a user namespaces you'd see:

chb@conventiont|~
> unshare -U
unshare: unshare failed: No space left on device

So there's not even a need to upstream a new sysctl since we have ways of
blocking this.

Also I'd like to point out that a lot of capability checks and actual security
vulnerabilities are associated with CAP_SYS_ADMIN. So what you likely want to do
is block CAP_SYS_ADMIN in user namespaces but at this point they become
basically useless for a lot of interesting use cases. In addition, this patch
would add another layer of complexity that is - imho - not really warranted
given what we already have. The relationship between capabilities and user
namespaces should stay as simply as possible so that it stays maintaineable.
User namespaces already introduce a proper layer of complexity.
Just my two cents. I might be totally off here of course.

Christian
On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
<christian.brauner@canonical.com> wrote:
> On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>> Sorry folks I was traveling and seems like lot happened on this thread. :p
>>
>> I will try to response few of these comments selectively -
>>
>> > The thing that makes me hesitate with this set is that it is a
>> > permanent new feature to address what (I hope) is a temporary
>> > problem.
>> I agree this is permanent new feature but it's not solving a temporary
>> problem. It's impossible to assess what and when new vulnerability
>> that could show up. I think Daniel summed it up appropriately in his
>> response
>>
>> > Seems like there are two naive ways to do it, the first being to just
>> > look at all code under ns_capable() plus code called from there.  It
>> > seems like looking at the result of that could be fruitful.
>> This is really hard. The main issue that there were features designed
>> and developed before user-ns days with an assumption that unprivileged
>> users will never get certain capabilities which only root user gets.
>> Now that is not true anymore with user-ns creation with mapping root
>> for any process. Also at the same time blocking user-ns creation for
>> eveyone is a big-hammer which is not needed too. So it's not that easy
>> to just perform a code-walk-though and correct those decisions now.
>>
>> > It seems to me that the existing control in
>> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
>> > in that case.
>> This solution is essentially blocking unprivileged users from using
>> the user-namespaces entirely. This is not really a solution that can
>> work. The solution that this patch-set adds allows unprivileged users
>> to create user-namespaces. Actually the proposed solution is more
>> fine-grained approach than the unprivileged_userns_clone solution
>> since you can selectively block capabilities rather than completely
>> blocking the functionality.
>
> I've been talking to Stéphane today about this and we should also keep in mind
> that we have:
>
> chb@conventiont|~
>> ls -al /proc/sys/user/
> total 0
> dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>
> These files allow you to limit the number of namespaces that can be created
> *per namespace* type. So let's say your system runs a bunch of user namespaces
> you can do:
>
> chb@conventiont|~
>> echo 0 > /proc/sys/user/max_user_namespaces
>
> So that the next time you try to create a user namespaces you'd see:
>
> chb@conventiont|~
>> unshare -U
> unshare: unshare failed: No space left on device
>
> So there's not even a need to upstream a new sysctl since we have ways of
> blocking this.
>
I'm not sure how it's solving the problem that my patch-set is addressing?
I agree though that the need for unprivileged_userns_clone sysctl goes
away as this is equivalent to setting that sysctl to 0 as you have
described above.
However as I mentioned earlier, blocking processes from creating
user-namespaces is not the solution. Processes should be able to
create namespaces as they are designed but at the same time we need to
have controls to 'contain' them if a need arise. Setting max_no to 0
is not the solution that I'm looking for since it doesn't solve the
problem.

> Also I'd like to point out that a lot of capability checks and actual security
> vulnerabilities are associated with CAP_SYS_ADMIN. So what you likely want to do
> is block CAP_SYS_ADMIN in user namespaces but at this point they become
> basically useless for a lot of interesting use cases. In addition, this patch
> would add another layer of complexity that is - imho - not really warranted
> given what we already have.
I disagree. I'm not sure how this patch is adding complexity? Simply
the functionality is maintained exactly as it is with an extra knob
which allows you to take control back if a situation arise. Once the
kernel is patched for whatever was discovered, life returns back to
normal by readjusting the knob. It's as simple as that!

> The relationship between capabilities and user
> namespaces should stay as simply as possible so that it stays maintaineable.
> User namespaces already introduce a proper layer of complexity.
> Just my two cents. I might be totally off here of course.
The side effect of the solution is that you have sort-of scaled-down /
broken functionality without blocking the feature completely until
life returns to normal. If the workload needs the exact same
capability that is being controlled, then tough luck but chances of
you having a workload that is not in contention with the capability
that is being controlled is very high. In a situation like that you
wont even notice the change but at the same time admin can ensure that
the exploit is contained. This has a very high value.

>

> Christian
Serge E. Hallyn Nov. 9, 2017, 3:21 a.m. UTC | #16
On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
> <christian.brauner@canonical.com> wrote:
> > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> >> Sorry folks I was traveling and seems like lot happened on this thread. :p
> >>
> >> I will try to response few of these comments selectively -
> >>
> >> > The thing that makes me hesitate with this set is that it is a
> >> > permanent new feature to address what (I hope) is a temporary
> >> > problem.
> >> I agree this is permanent new feature but it's not solving a temporary
> >> problem. It's impossible to assess what and when new vulnerability
> >> that could show up. I think Daniel summed it up appropriately in his
> >> response
> >>
> >> > Seems like there are two naive ways to do it, the first being to just
> >> > look at all code under ns_capable() plus code called from there.  It
> >> > seems like looking at the result of that could be fruitful.
> >> This is really hard. The main issue that there were features designed
> >> and developed before user-ns days with an assumption that unprivileged
> >> users will never get certain capabilities which only root user gets.
> >> Now that is not true anymore with user-ns creation with mapping root
> >> for any process. Also at the same time blocking user-ns creation for
> >> eveyone is a big-hammer which is not needed too. So it's not that easy
> >> to just perform a code-walk-though and correct those decisions now.
> >>
> >> > It seems to me that the existing control in
> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> >> > in that case.
> >> This solution is essentially blocking unprivileged users from using
> >> the user-namespaces entirely. This is not really a solution that can
> >> work. The solution that this patch-set adds allows unprivileged users
> >> to create user-namespaces. Actually the proposed solution is more
> >> fine-grained approach than the unprivileged_userns_clone solution
> >> since you can selectively block capabilities rather than completely
> >> blocking the functionality.
> >
> > I've been talking to Stéphane today about this and we should also keep in mind
> > that we have:
> >
> > chb@conventiont|~
> >> ls -al /proc/sys/user/
> > total 0
> > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
> >
> > These files allow you to limit the number of namespaces that can be created
> > *per namespace* type. So let's say your system runs a bunch of user namespaces
> > you can do:
> >
> > chb@conventiont|~
> >> echo 0 > /proc/sys/user/max_user_namespaces
> >
> > So that the next time you try to create a user namespaces you'd see:
> >
> > chb@conventiont|~
> >> unshare -U
> > unshare: unshare failed: No space left on device
> >
> > So there's not even a need to upstream a new sysctl since we have ways of
> > blocking this.
> >
> I'm not sure how it's solving the problem that my patch-set is addressing?
> I agree though that the need for unprivileged_userns_clone sysctl goes
> away as this is equivalent to setting that sysctl to 0 as you have
> described above.

oh right that was the reasoning iirc for not needing the other sysctl.

> However as I mentioned earlier, blocking processes from creating
> user-namespaces is not the solution. Processes should be able to
> create namespaces as they are designed but at the same time we need to
> have controls to 'contain' them if a need arise. Setting max_no to 0
> is not the solution that I'm looking for since it doesn't solve the
> problem.

well yesterday we were told that was explicitly not the goal, but that was 
not by you ... i just mention it to explain why we seem to be walking in
circles a bit.

anyway the bounding set doesn't actually make sense so forget that.   the
question then is just whether it makes sense to allow things to continue
at all in this situation.  would you mind indulging me by giving one or two
concrete examples in the previous known cves of what capabilities you would
have dropped tto allow the rest to continue to be safely used?

thanks,
serge
On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार)
wrote:
>> On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
>> <christian.brauner@canonical.com> wrote:
>> > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश
बंडेवार) wrote:
>> >> Sorry folks I was traveling and seems like lot happened on this
thread. :p
>> >>
>> >> I will try to response few of these comments selectively -
>> >>
>> >> > The thing that makes me hesitate with this set is that it is a
>> >> > permanent new feature to address what (I hope) is a temporary
>> >> > problem.
>> >> I agree this is permanent new feature but it's not solving a temporary
>> >> problem. It's impossible to assess what and when new vulnerability
>> >> that could show up. I think Daniel summed it up appropriately in his
>> >> response
>> >>
>> >> > Seems like there are two naive ways to do it, the first being to
just
>> >> > look at all code under ns_capable() plus code called from there.  It
>> >> > seems like looking at the result of that could be fruitful.
>> >> This is really hard. The main issue that there were features designed
>> >> and developed before user-ns days with an assumption that unprivileged
>> >> users will never get certain capabilities which only root user gets.
>> >> Now that is not true anymore with user-ns creation with mapping root
>> >> for any process. Also at the same time blocking user-ns creation for
>> >> eveyone is a big-hammer which is not needed too. So it's not that easy
>> >> to just perform a code-walk-though and correct those decisions now.
>> >>
>> >> > It seems to me that the existing control in
>> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct
tape
>> >> > in that case.
>> >> This solution is essentially blocking unprivileged users from using
>> >> the user-namespaces entirely. This is not really a solution that can
>> >> work. The solution that this patch-set adds allows unprivileged users
>> >> to create user-namespaces. Actually the proposed solution is more
>> >> fine-grained approach than the unprivileged_userns_clone solution
>> >> since you can selectively block capabilities rather than completely
>> >> blocking the functionality.
>> >
>> > I've been talking to Stéphane today about this and we should also keep
in mind
>> > that we have:
>> >
>> > chb@conventiont|~
>> >> ls -al /proc/sys/user/
>> > total 0
>> > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
>> > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
>> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>> >
>> > These files allow you to limit the number of namespaces that can be
created
>> > *per namespace* type. So let's say your system runs a bunch of user
namespaces
>> > you can do:
>> >
>> > chb@conventiont|~
>> >> echo 0 > /proc/sys/user/max_user_namespaces
>> >
>> > So that the next time you try to create a user namespaces you'd see:
>> >
>> > chb@conventiont|~
>> >> unshare -U
>> > unshare: unshare failed: No space left on device
>> >
>> > So there's not even a need to upstream a new sysctl since we have ways
of
>> > blocking this.
>> >
>> I'm not sure how it's solving the problem that my patch-set is
addressing?
>> I agree though that the need for unprivileged_userns_clone sysctl goes
>> away as this is equivalent to setting that sysctl to 0 as you have
>> described above.
>
> oh right that was the reasoning iirc for not needing the other sysctl.
>
>> However as I mentioned earlier, blocking processes from creating
>> user-namespaces is not the solution. Processes should be able to
>> create namespaces as they are designed but at the same time we need to
>> have controls to 'contain' them if a need arise. Setting max_no to 0
>> is not the solution that I'm looking for since it doesn't solve the
>> problem.
>
> well yesterday we were told that was explicitly not the goal, but that was
> not by you ... i just mention it to explain why we seem to be walking in
> circles a bit.
>
> anyway the bounding set doesn't actually make sense so forget that.   the
> question then is just whether it makes sense to allow things to continue
> at all in this situation.  would you mind indulging me by giving one or
two
> concrete examples in the previous known cves of what capabilities you
would
> have dropped tto allow the rest to continue to be safely used?
>
Of course. Let's take an example of the CVE that I have mentioned in my
cover-letter - CVE-2017-7308
<https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308>. It's well
documented and even has a exploit c-program
<https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308> that
can demonstrate how it can be used against non-patched kernel. There is
very nice blog post
<https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html>
about this vulnerability by Andrey Konovalov.

This is about the AF_PACKET socket interface that is protected behind
NET_RAW capability. This capability is not available to unprivileged user.
However, any unprivileged user can get NET_RAW capability (as demonstrated
in the cover-letter code that I have attached in this patch series) so this
NET_RAW capability is available to any unprivileged user on the host if the
kernel has user-namespaces available.

With this patch-set applied, all that is needed is to flip a bit with the
sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated below -

root@lphh6:~# uname -a
Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux
root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist
kernel.controlled_userns_caps_whitelist = 1f,ffffffff

Now when I run the program (demo from the cover-letter) as a normal
unprivileged user I can't create a RAW socket in init-ns but I can in the
child-ns.

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
Successfully opened RAW-Sock after unshare().
dumbo@lphh6:~$

Now as a root user. Take off CAP_NET_RAW

root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
root@lphh6:~#

Now run the same program as an unprivileged user -

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
socket() SOCK_RAW failed: : Operation not permitted
dumbo@lphh6:~$

Notice that it has failed to create a raw socket in init and in child
namespace. It's not blocking creation of user-namespaces but allowing admin
turn individual capability bits on and off.

This is very simplistic example of just demonstrating how capability bits
turn-on/off works. So let's assume a sandboxed environment where we don't
know what a binary that we are about run in an environment which is
identified as susceptible. By turning off the NET_RAW bit, the admin gets
an assurance that system is safe and if binary fails because it's not
getting this capability then that bad but a sad consequence (without
compromising the host integrity) but if it doesn't use the NET_RAW
capability but any other combination of remaining 36 capabilities, it would
get whatever is necessary. This means we can safely allow processes to
create user-namespaces by taking off certain capabilities in question for
temporary/extended period until proper fix is applied without compromising
the system integrity. The impact will vary based on which capability is
taken off and admin would / should be ware of for the environment that
he/she is dealing with.

thanks,
--mahesh..

> thanks,
> serge
[resend response as earlier one failed because of formatting issues]

On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
> > <christian.brauner@canonical.com> wrote:
> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> > >> Sorry folks I was traveling and seems like lot happened on this thread. :p
> > >>
> > >> I will try to response few of these comments selectively -
> > >>
> > >> > The thing that makes me hesitate with this set is that it is a
> > >> > permanent new feature to address what (I hope) is a temporary
> > >> > problem.
> > >> I agree this is permanent new feature but it's not solving a temporary
> > >> problem. It's impossible to assess what and when new vulnerability
> > >> that could show up. I think Daniel summed it up appropriately in his
> > >> response
> > >>
> > >> > Seems like there are two naive ways to do it, the first being to just
> > >> > look at all code under ns_capable() plus code called from there.  It
> > >> > seems like looking at the result of that could be fruitful.
> > >> This is really hard. The main issue that there were features designed
> > >> and developed before user-ns days with an assumption that unprivileged
> > >> users will never get certain capabilities which only root user gets.
> > >> Now that is not true anymore with user-ns creation with mapping root
> > >> for any process. Also at the same time blocking user-ns creation for
> > >> eveyone is a big-hammer which is not needed too. So it's not that easy
> > >> to just perform a code-walk-though and correct those decisions now.
> > >>
> > >> > It seems to me that the existing control in
> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> > >> > in that case.
> > >> This solution is essentially blocking unprivileged users from using
> > >> the user-namespaces entirely. This is not really a solution that can
> > >> work. The solution that this patch-set adds allows unprivileged users
> > >> to create user-namespaces. Actually the proposed solution is more
> > >> fine-grained approach than the unprivileged_userns_clone solution
> > >> since you can selectively block capabilities rather than completely
> > >> blocking the functionality.
> > >
> > > I've been talking to Stéphane today about this and we should also keep in mind
> > > that we have:
> > >
> > > chb@conventiont|~
> > >> ls -al /proc/sys/user/
> > > total 0
> > > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> > > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
> > >
> > > These files allow you to limit the number of namespaces that can be created
> > > *per namespace* type. So let's say your system runs a bunch of user namespaces
> > > you can do:
> > >
> > > chb@conventiont|~
> > >> echo 0 > /proc/sys/user/max_user_namespaces
> > >
> > > So that the next time you try to create a user namespaces you'd see:
> > >
> > > chb@conventiont|~
> > >> unshare -U
> > > unshare: unshare failed: No space left on device
> > >
> > > So there's not even a need to upstream a new sysctl since we have ways of
> > > blocking this.
> > >
> > I'm not sure how it's solving the problem that my patch-set is addressing?
> > I agree though that the need for unprivileged_userns_clone sysctl goes
> > away as this is equivalent to setting that sysctl to 0 as you have
> > described above.
>
> oh right that was the reasoning iirc for not needing the other sysctl.
>
> > However as I mentioned earlier, blocking processes from creating
> > user-namespaces is not the solution. Processes should be able to
> > create namespaces as they are designed but at the same time we need to
> > have controls to 'contain' them if a need arise. Setting max_no to 0
> > is not the solution that I'm looking for since it doesn't solve the
> > problem.
>
> well yesterday we were told that was explicitly not the goal, but that was
> not by you ... i just mention it to explain why we seem to be walking in
> circles a bit.
>
> anyway the bounding set doesn't actually make sense so forget that.   the
> question then is just whether it makes sense to allow things to continue
> at all in this situation.  would you mind indulging me by giving one or two
> concrete examples in the previous known cves of what capabilities you would
> have dropped tto allow the rest to continue to be safely used?
>
Of course. Let's take an example of the CVE that I have mentioned in
my cover-letter -
CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308).
It's well documented and even has a
exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
c-program that can demonstrate how it can be used against non-patched
kernel. There is very nice blog
post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html)
about this vulnerability by Andrey Konovalov.

This is about the AF_PACKET socket interface that is protected behind
NET_RAW capability. This capability is not available to unprivileged
user. However, any unprivileged user can get NET_RAW capability (as
demonstrated in the cover-letter code that I have attached in this
patch series) so this NET_RAW capability is available to any
unprivileged user on the host if the kernel has user-namespaces
available.

With this patch-set applied, all that is needed is to flip a bit with
the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated
below -

root@lphh6:~# uname -a
Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux
root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist
kernel.controlled_userns_caps_whitelist = 1f,ffffffff

Now when I run the program (demo from the cover-letter) as a normal
unprivileged user I can't create a RAW socket in init-ns but I can in
the child-ns.

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
Successfully opened RAW-Sock after unshare().
dumbo@lphh6:~$

Now as a root user. Take off CAP_NET_RAW

root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
root@lphh6:~#

Now run the same program as an unprivileged user -

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
socket() SOCK_RAW failed: : Operation not permitted
dumbo@lphh6:~$

Notice that it has failed to create a raw socket in init and in child
namespace. It's not blocking creation of user-namespaces but allowing
admin turn individual capability bits on and off.

This is very simplistic example of just demonstrating how capability
bits turn-on/off works. So let's assume a sandboxed environment where
we don't know what a binary that we are about run in an environment
which is identified as susceptible. By turning off the NET_RAW bit,
the admin gets an assurance that system is safe and if binary fails
because it's not getting this capability then that bad but a sad
consequence (without compromising the host integrity) but if it
doesn't use the NET_RAW capability but any other combination of
remaining 36 capabilities, it would get whatever is necessary. This
means we can safely allow processes to create user-namespaces by
taking off certain capabilities in question for temporary/extended
period until proper fix is applied without compromising the system
integrity. The impact will vary based on which capability is taken off
and admin would / should be ware of for the environment that he/she is
dealing with.

thanks,
--mahesh..

> thanks,
> serge
Serge E. Hallyn Nov. 9, 2017, 4:14 p.m. UTC | #19
Quoting Mahesh Bandewar (महेश बंडेवार) (maheshb@google.com):
> Of course. Let's take an example of the CVE that I have mentioned in
> my cover-letter -
> CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308).
> It's well documented and even has a
> exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
> c-program that can demonstrate how it can be used against non-patched
> kernel. There is very nice blog
> post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html)
> about this vulnerability by Andrey Konovalov.

Ok, thanks.  It's a good example because the fix for this CVE actually
came by itself (http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/changelog).
Normally multiple CVEs come at the same time, which would make a
workaround for one now helpful.  This is a good counter-example.

I'm going to maintain that I really don't like this.  But it looks
useful, so ack on the concept, I'll just have to look again at the
code now.  Thanks for indulging me.

-serge
Serge E. Hallyn Nov. 9, 2017, 5:25 p.m. UTC | #20
Quoting Mahesh Bandewar (mahesh@bandewar.net):
> From: Mahesh Bandewar <maheshb@google.com>
> 
> With this new notion of "controlled" user-namespaces, the controlled
> user-namespaces are marked at the time of their creation while the
> capabilities of processes that belong to them are controlled using the
> global mask.
> 
> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> that belongs to uncontrolled user-ns can create another (child) user-
> namespace that is uncontrolled. Any other process (that either does
> not have SYS_ADMIN or belongs to a controlled user-ns) can only
> create a user-ns that is controlled.
> 
> global-capability-whitelist (controlled_userns_caps_whitelist) is used
> at the capability check-time and keeps the semantics for the processes
> that belong to uncontrolled user-ns as it is. Processes that belong to
> controlled user-ns however are subjected to different checks-
> 
>    (a) if the capability in question is controlled and process belongs
>        to controlled user-ns, then it's always denied.
>    (b) if the capability in question is NOT controlled then fall back
>        to the traditional check.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
>  include/linux/capability.h     |  1 +
>  include/linux/user_namespace.h | 20 ++++++++++++++++++++
>  kernel/capability.c            |  5 +++++
>  kernel/user_namespace.c        |  3 +++
>  security/commoncap.c           |  8 ++++++++
>  5 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 6c0b9677c03f..b8c6cac18658 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>  				 void __user *buff, size_t *lenp, loff_t *ppos);
> +bool is_capability_controlled(int cap);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
>  
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index c18e01252346..e890fe81b47e 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -22,6 +22,7 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
>  };
>  
>  #define USERNS_SETGROUPS_ALLOWED 1UL
> +#define USERNS_CONTROLLED	 2UL
>  
>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>  
> @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
>  		__put_user_ns(ns);
>  }
>  
> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> +	return ns->flags & USERNS_CONTROLLED;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> +	ns->flags |= USERNS_CONTROLLED;
> +}
> +
>  struct seq_operations;
>  extern const struct seq_operations proc_uid_seq_operations;
>  extern const struct seq_operations proc_gid_seq_operations;
> @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
>  {
>  	return ERR_PTR(-EPERM);
>  }
> +
> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> +	return false;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 62dbe3350c1b..40a38cc4ff43 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
>  }
>  
>  /* Controlled-userns capabilities routines */
> +bool is_capability_controlled(int cap)
> +{
> +	return !cap_raised(controlled_userns_caps_whitelist, cap);
> +}
> +
>  #ifdef CONFIG_SYSCTL
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>  				 void __user *buff, size_t *lenp, loff_t *ppos)
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..f393ea5108f0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>  	cred->cap_effective = CAP_FULL_SET;
>  	cred->cap_ambient = CAP_EMPTY_SET;
>  	cred->cap_bset = CAP_FULL_SET;
> +	if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
> +	    is_user_ns_controlled(user_ns->parent))
> +		mark_user_ns_controlled(user_ns);

Hm, why do this here, rather than at create_user_ns()?  It
shouldn't be recalculated when someone does setns() should it?
Chris Hyser Nov. 9, 2017, 6:01 p.m. UTC | #21
On 11/06/2017 10:23 PM, Serge E. Hallyn wrote:
> I think I definately prefer what I mentioned in the email to Boris.
> Basically a "permanent capability bounding set".  The normal bounding
> set gets reset to a full set on every new user_ns creation.  In this
> proposal, it would instead be set to the calling task's permanent
> capability set, which starts (at boot) full, and which privileged
> tasks can pull capabilities out of.

Actually, this may solve a similar problem I've been looking at. The 
idea was basically at strategic points in the kernel (possibly LSM hook 
sites, still evaluating, and probably syscall entry) validate that a 
task has not "magically" acquired capabilities that it or parent 
specifically said it cannot have and then take some action like say 
killing it immediately. Using your terms, basically make the "permanent 
capability set" a write-once privilege escalation defense. To handle the 
0-day threat, perhaps make it writable but only with more "restrictive" 
values.

-chrish
Serge E. Hallyn Nov. 9, 2017, 6:05 p.m. UTC | #22
Quoting chris hyser (chris.hyser@oracle.com):
> On 11/06/2017 10:23 PM, Serge E. Hallyn wrote:
> >I think I definately prefer what I mentioned in the email to Boris.
> >Basically a "permanent capability bounding set".  The normal bounding
> >set gets reset to a full set on every new user_ns creation.  In this
> >proposal, it would instead be set to the calling task's permanent
> >capability set, which starts (at boot) full, and which privileged
> >tasks can pull capabilities out of.
> 
> Actually, this may solve a similar problem I've been looking at. The
> idea was basically at strategic points in the kernel (possibly LSM
> hook sites, still evaluating, and probably syscall entry) validate
> that a task has not "magically" acquired capabilities that it or
> parent specifically said it cannot have and then take some action
> like say killing it immediately. Using your terms, basically make
> the "permanent capability set" a write-once privilege escalation
> defense. To handle the 0-day threat, perhaps make it writable but
> only with more "restrictive" values.

Would the existing capability bounding set not suffice for that?

The 'permanent' bounding set turns out to not be a good fit for
the problem being discussed in this thread, but please feel free
to start a new thread if you want to discuss your use case.
Chris Hyser Nov. 9, 2017, 6:27 p.m. UTC | #23
On 11/09/2017 01:05 PM, Serge E. Hallyn wrote:
> Would the existing capability bounding set not suffice for that?
> 
> The 'permanent' bounding set turns out to not be a good fit for
> the problem being discussed in this thread, but please feel free
> to start a new thread if you want to discuss your use case.

Sure. I will formulate something for a new thread. What seems to be 
asked for here is a way to globally patch the capability sets of a 
entire process subtree.

-chrish
Eric W. Biederman Nov. 9, 2017, 9:58 p.m. UTC | #24
"Mahesh Bandewar (महेश बंडेवार)" <maheshb@google.com> writes:

> [resend response as earlier one failed because of formatting issues]
>
> On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>
>> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote:
>> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
>> > <christian.brauner@canonical.com> wrote:
>> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>> > >> Sorry folks I was traveling and seems like lot happened on this thread. :p
>> > >>
>> > >> I will try to response few of these comments selectively -
>> > >>
>> > >> > The thing that makes me hesitate with this set is that it is a
>> > >> > permanent new feature to address what (I hope) is a temporary
>> > >> > problem.
>> > >> I agree this is permanent new feature but it's not solving a temporary
>> > >> problem. It's impossible to assess what and when new vulnerability
>> > >> that could show up. I think Daniel summed it up appropriately in his
>> > >> response
>> > >>
>> > >> > Seems like there are two naive ways to do it, the first being to just
>> > >> > look at all code under ns_capable() plus code called from there.  It
>> > >> > seems like looking at the result of that could be fruitful.
>> > >> This is really hard. The main issue that there were features designed
>> > >> and developed before user-ns days with an assumption that unprivileged
>> > >> users will never get certain capabilities which only root user gets.
>> > >> Now that is not true anymore with user-ns creation with mapping root
>> > >> for any process. Also at the same time blocking user-ns creation for
>> > >> eveyone is a big-hammer which is not needed too. So it's not that easy
>> > >> to just perform a code-walk-though and correct those decisions now.
>> > >>
>> > >> > It seems to me that the existing control in
>> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
>> > >> > in that case.
>> > >> This solution is essentially blocking unprivileged users from using
>> > >> the user-namespaces entirely. This is not really a solution that can
>> > >> work. The solution that this patch-set adds allows unprivileged users
>> > >> to create user-namespaces. Actually the proposed solution is more
>> > >> fine-grained approach than the unprivileged_userns_clone solution
>> > >> since you can selectively block capabilities rather than completely
>> > >> blocking the functionality.
>> > >
>> > > I've been talking to Stéphane today about this and we should also keep in mind
>> > > that we have:
>> > >
>> > > chb@conventiont|~
>> > >> ls -al /proc/sys/user/
>> > > total 0
>> > > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
>> > > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>> > >
>> > > These files allow you to limit the number of namespaces that can be created
>> > > *per namespace* type. So let's say your system runs a bunch of user namespaces
>> > > you can do:
>> > >
>> > > chb@conventiont|~
>> > >> echo 0 > /proc/sys/user/max_user_namespaces
>> > >
>> > > So that the next time you try to create a user namespaces you'd see:
>> > >
>> > > chb@conventiont|~
>> > >> unshare -U
>> > > unshare: unshare failed: No space left on device
>> > >
>> > > So there's not even a need to upstream a new sysctl since we have ways of
>> > > blocking this.
>> > >
>> > I'm not sure how it's solving the problem that my patch-set is addressing?
>> > I agree though that the need for unprivileged_userns_clone sysctl goes
>> > away as this is equivalent to setting that sysctl to 0 as you have
>> > described above.
>>
>> oh right that was the reasoning iirc for not needing the other sysctl.
>>
>> > However as I mentioned earlier, blocking processes from creating
>> > user-namespaces is not the solution. Processes should be able to
>> > create namespaces as they are designed but at the same time we need to
>> > have controls to 'contain' them if a need arise. Setting max_no to 0
>> > is not the solution that I'm looking for since it doesn't solve the
>> > problem.
>>
>> well yesterday we were told that was explicitly not the goal, but that was
>> not by you ... i just mention it to explain why we seem to be walking in
>> circles a bit.
>>
>> anyway the bounding set doesn't actually make sense so forget that.   the
>> question then is just whether it makes sense to allow things to continue
>> at all in this situation.  would you mind indulging me by giving one or two
>> concrete examples in the previous known cves of what capabilities you would
>> have dropped tto allow the rest to continue to be safely used?
>>
> Of course. Let's take an example of the CVE that I have mentioned in
> my cover-letter -
> CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308).
> It's well documented and even has a
> exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
> c-program that can demonstrate how it can be used against non-patched
> kernel. There is very nice blog
> post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html)
> about this vulnerability by Andrey Konovalov.
>
> This is about the AF_PACKET socket interface that is protected behind
> NET_RAW capability. This capability is not available to unprivileged
> user. However, any unprivileged user can get NET_RAW capability (as
> demonstrated in the cover-letter code that I have attached in this
> patch series) so this NET_RAW capability is available to any
> unprivileged user on the host if the kernel has user-namespaces
> available.
>
> With this patch-set applied, all that is needed is to flip a bit with
> the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated
> below -
>
> root@lphh6:~# uname -a
> Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux
> root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist
> kernel.controlled_userns_caps_whitelist = 1f,ffffffff
>
> Now when I run the program (demo from the cover-letter) as a normal
> unprivileged user I can't create a RAW socket in init-ns but I can in
> the child-ns.
>
> dumbo@lphh6:~$ /tmp/acquire_raw
> Attempting to open RAW socket before unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
> Attempting to open RAW socket after unshare()...
> Successfully opened RAW-Sock after unshare().
> dumbo@lphh6:~$
>
> Now as a root user. Take off CAP_NET_RAW
>
> root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
> kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
> root@lphh6:~#
>
> Now run the same program as an unprivileged user -
>
> dumbo@lphh6:~$ /tmp/acquire_raw
> Attempting to open RAW socket before unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
> Attempting to open RAW socket after unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
> dumbo@lphh6:~$
>
> Notice that it has failed to create a raw socket in init and in child
> namespace. It's not blocking creation of user-namespaces but allowing
> admin turn individual capability bits on and off.
>
> This is very simplistic example of just demonstrating how capability
> bits turn-on/off works. So let's assume a sandboxed environment where
> we don't know what a binary that we are about run in an environment
> which is identified as susceptible. By turning off the NET_RAW bit,
> the admin gets an assurance that system is safe and if binary fails
> because it's not getting this capability then that bad but a sad
> consequence (without compromising the host integrity) but if it
> doesn't use the NET_RAW capability but any other combination of
> remaining 36 capabilities, it would get whatever is necessary. This
> means we can safely allow processes to create user-namespaces by
> taking off certain capabilities in question for temporary/extended
> period until proper fix is applied without compromising the system
> integrity. The impact will vary based on which capability is taken off
> and admin would / should be ware of for the environment that he/she is
> dealing with.

My challenge with this reasoning is that I don't know that it meanifully
generalizes to any other capability.

I can in the sandbox today create a user namespace and then set
max_net_namespaces to 0, and drop CAP_NET_RAW and that blocks
the attack.  (Possibly with a little spice to prevent a suid root
program from reacquiring CAP_NET_RAW).

So while your solution doesn't look horrible especially if it can be
done at a user namespace level so the restrictions can be limited to a
single sandbox.  I am not at all certain that the capabilities is the
proper place to limit code reachability.

I would very much like to see which capabilities that are available with
ns_capable, are more meaningful to limit than just dropping the
capability during sandbox creation and denying the creation of the
corresponding namespace.

CAP_NET_RAW is one.  Are there any other capabilities that are
meanginful to limit?

Eric
On Fri, Nov 10, 2017 at 2:25 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Mahesh Bandewar (mahesh@bandewar.net):
>> From: Mahesh Bandewar <maheshb@google.com>
>>
>> With this new notion of "controlled" user-namespaces, the controlled
>> user-namespaces are marked at the time of their creation while the
>> capabilities of processes that belong to them are controlled using the
>> global mask.
>>
>> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
>> that belongs to uncontrolled user-ns can create another (child) user-
>> namespace that is uncontrolled. Any other process (that either does
>> not have SYS_ADMIN or belongs to a controlled user-ns) can only
>> create a user-ns that is controlled.
>>
>> global-capability-whitelist (controlled_userns_caps_whitelist) is used
>> at the capability check-time and keeps the semantics for the processes
>> that belong to uncontrolled user-ns as it is. Processes that belong to
>> controlled user-ns however are subjected to different checks-
>>
>>    (a) if the capability in question is controlled and process belongs
>>        to controlled user-ns, then it's always denied.
>>    (b) if the capability in question is NOT controlled then fall back
>>        to the traditional check.
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>>  include/linux/capability.h     |  1 +
>>  include/linux/user_namespace.h | 20 ++++++++++++++++++++
>>  kernel/capability.c            |  5 +++++
>>  kernel/user_namespace.c        |  3 +++
>>  security/commoncap.c           |  8 ++++++++
>>  5 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/capability.h b/include/linux/capability.h
>> index 6c0b9677c03f..b8c6cac18658 100644
>> --- a/include/linux/capability.h
>> +++ b/include/linux/capability.h
>> @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
>>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>                                void __user *buff, size_t *lenp, loff_t *ppos);
>> +bool is_capability_controlled(int cap);
>>
>>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index c18e01252346..e890fe81b47e 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -22,6 +22,7 @@ struct uid_gid_map {        /* 64 bytes -- 1 cache line */
>>  };
>>
>>  #define USERNS_SETGROUPS_ALLOWED 1UL
>> +#define USERNS_CONTROLLED     2UL
>>
>>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>>
>> @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
>>               __put_user_ns(ns);
>>  }
>>
>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> +     return ns->flags & USERNS_CONTROLLED;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> +     ns->flags |= USERNS_CONTROLLED;
>> +}
>> +
>>  struct seq_operations;
>>  extern const struct seq_operations proc_uid_seq_operations;
>>  extern const struct seq_operations proc_gid_seq_operations;
>> @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
>>  {
>>       return ERR_PTR(-EPERM);
>>  }
>> +
>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> +     return false;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> +}
>>  #endif
>>
>>  #endif /* _LINUX_USER_H */
>> diff --git a/kernel/capability.c b/kernel/capability.c
>> index 62dbe3350c1b..40a38cc4ff43 100644
>> --- a/kernel/capability.c
>> +++ b/kernel/capability.c
>> @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
>>  }
>>
>>  /* Controlled-userns capabilities routines */
>> +bool is_capability_controlled(int cap)
>> +{
>> +     return !cap_raised(controlled_userns_caps_whitelist, cap);
>> +}
>> +
>>  #ifdef CONFIG_SYSCTL
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>                                void __user *buff, size_t *lenp, loff_t *ppos)
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index c490f1e4313b..f393ea5108f0 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>>       cred->cap_effective = CAP_FULL_SET;
>>       cred->cap_ambient = CAP_EMPTY_SET;
>>       cred->cap_bset = CAP_FULL_SET;
>> +     if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
>> +         is_user_ns_controlled(user_ns->parent))
>> +             mark_user_ns_controlled(user_ns);
>
> Hm, why do this here, rather than at create_user_ns()? It
> shouldn't be recalculated when someone does setns() should it?
>
You are absolutely right! It doesn't make sense to recalculate for
every setns() call. It's a side effect of couple of iterations /
approaches that I tried before finalizing this one. I'll move this
block to create_user_ns() after the set_cred_user_ns() call so that
this wont be triggered in setns() path.

Thanks,
--mahesh..
On Fri, Nov 10, 2017 at 6:58 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> "Mahesh Bandewar (महेश बंडेवार)" <maheshb@google.com> writes:
>
>> [resend response as earlier one failed because of formatting issues]
>>
>> On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>>
>>> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote:
>>> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
>>> > <christian.brauner@canonical.com> wrote:
>>> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>>> > >> Sorry folks I was traveling and seems like lot happened on this thread. :p
>>> > >>
>>> > >> I will try to response few of these comments selectively -
>>> > >>
>>> > >> > The thing that makes me hesitate with this set is that it is a
>>> > >> > permanent new feature to address what (I hope) is a temporary
>>> > >> > problem.
>>> > >> I agree this is permanent new feature but it's not solving a temporary
>>> > >> problem. It's impossible to assess what and when new vulnerability
>>> > >> that could show up. I think Daniel summed it up appropriately in his
>>> > >> response
>>> > >>
>>> > >> > Seems like there are two naive ways to do it, the first being to just
>>> > >> > look at all code under ns_capable() plus code called from there.  It
>>> > >> > seems like looking at the result of that could be fruitful.
>>> > >> This is really hard. The main issue that there were features designed
>>> > >> and developed before user-ns days with an assumption that unprivileged
>>> > >> users will never get certain capabilities which only root user gets.
>>> > >> Now that is not true anymore with user-ns creation with mapping root
>>> > >> for any process. Also at the same time blocking user-ns creation for
>>> > >> eveyone is a big-hammer which is not needed too. So it's not that easy
>>> > >> to just perform a code-walk-though and correct those decisions now.
>>> > >>
>>> > >> > It seems to me that the existing control in
>>> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
>>> > >> > in that case.
>>> > >> This solution is essentially blocking unprivileged users from using
>>> > >> the user-namespaces entirely. This is not really a solution that can
>>> > >> work. The solution that this patch-set adds allows unprivileged users
>>> > >> to create user-namespaces. Actually the proposed solution is more
>>> > >> fine-grained approach than the unprivileged_userns_clone solution
>>> > >> since you can selectively block capabilities rather than completely
>>> > >> blocking the functionality.
>>> > >
>>> > > I've been talking to Stéphane today about this and we should also keep in mind
>>> > > that we have:
>>> > >
>>> > > chb@conventiont|~
>>> > >> ls -al /proc/sys/user/
>>> > > total 0
>>> > > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
>>> > > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>>> > >
>>> > > These files allow you to limit the number of namespaces that can be created
>>> > > *per namespace* type. So let's say your system runs a bunch of user namespaces
>>> > > you can do:
>>> > >
>>> > > chb@conventiont|~
>>> > >> echo 0 > /proc/sys/user/max_user_namespaces
>>> > >
>>> > > So that the next time you try to create a user namespaces you'd see:
>>> > >
>>> > > chb@conventiont|~
>>> > >> unshare -U
>>> > > unshare: unshare failed: No space left on device
>>> > >
>>> > > So there's not even a need to upstream a new sysctl since we have ways of
>>> > > blocking this.
>>> > >
>>> > I'm not sure how it's solving the problem that my patch-set is addressing?
>>> > I agree though that the need for unprivileged_userns_clone sysctl goes
>>> > away as this is equivalent to setting that sysctl to 0 as you have
>>> > described above.
>>>
>>> oh right that was the reasoning iirc for not needing the other sysctl.
>>>
>>> > However as I mentioned earlier, blocking processes from creating
>>> > user-namespaces is not the solution. Processes should be able to
>>> > create namespaces as they are designed but at the same time we need to
>>> > have controls to 'contain' them if a need arise. Setting max_no to 0
>>> > is not the solution that I'm looking for since it doesn't solve the
>>> > problem.
>>>
>>> well yesterday we were told that was explicitly not the goal, but that was
>>> not by you ... i just mention it to explain why we seem to be walking in
>>> circles a bit.
>>>
>>> anyway the bounding set doesn't actually make sense so forget that.   the
>>> question then is just whether it makes sense to allow things to continue
>>> at all in this situation.  would you mind indulging me by giving one or two
>>> concrete examples in the previous known cves of what capabilities you would
>>> have dropped tto allow the rest to continue to be safely used?
>>>
>> Of course. Let's take an example of the CVE that I have mentioned in
>> my cover-letter -
>> CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308).
>> It's well documented and even has a
>> exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
>> c-program that can demonstrate how it can be used against non-patched
>> kernel. There is very nice blog
>> post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html)
>> about this vulnerability by Andrey Konovalov.
>>
>> This is about the AF_PACKET socket interface that is protected behind
>> NET_RAW capability. This capability is not available to unprivileged
>> user. However, any unprivileged user can get NET_RAW capability (as
>> demonstrated in the cover-letter code that I have attached in this
>> patch series) so this NET_RAW capability is available to any
>> unprivileged user on the host if the kernel has user-namespaces
>> available.
>>
>> With this patch-set applied, all that is needed is to flip a bit with
>> the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated
>> below -
>>
>> root@lphh6:~# uname -a
>> Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux
>> root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist
>> kernel.controlled_userns_caps_whitelist = 1f,ffffffff
>>
>> Now when I run the program (demo from the cover-letter) as a normal
>> unprivileged user I can't create a RAW socket in init-ns but I can in
>> the child-ns.
>>
>> dumbo@lphh6:~$ /tmp/acquire_raw
>> Attempting to open RAW socket before unshare()...
>> socket() SOCK_RAW failed: : Operation not permitted
>> Attempting to open RAW socket after unshare()...
>> Successfully opened RAW-Sock after unshare().
>> dumbo@lphh6:~$
>>
>> Now as a root user. Take off CAP_NET_RAW
>>
>> root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
>> kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
>> root@lphh6:~#
>>
>> Now run the same program as an unprivileged user -
>>
>> dumbo@lphh6:~$ /tmp/acquire_raw
>> Attempting to open RAW socket before unshare()...
>> socket() SOCK_RAW failed: : Operation not permitted
>> Attempting to open RAW socket after unshare()...
>> socket() SOCK_RAW failed: : Operation not permitted
>> dumbo@lphh6:~$
>>
>> Notice that it has failed to create a raw socket in init and in child
>> namespace. It's not blocking creation of user-namespaces but allowing
>> admin turn individual capability bits on and off.
>>
>> This is very simplistic example of just demonstrating how capability
>> bits turn-on/off works. So let's assume a sandboxed environment where
>> we don't know what a binary that we are about run in an environment
>> which is identified as susceptible. By turning off the NET_RAW bit,
>> the admin gets an assurance that system is safe and if binary fails
>> because it's not getting this capability then that bad but a sad
>> consequence (without compromising the host integrity) but if it
>> doesn't use the NET_RAW capability but any other combination of
>> remaining 36 capabilities, it would get whatever is necessary. This
>> means we can safely allow processes to create user-namespaces by
>> taking off certain capabilities in question for temporary/extended
>> period until proper fix is applied without compromising the system
>> integrity. The impact will vary based on which capability is taken off
>> and admin would / should be ware of for the environment that he/she is
>> dealing with.
>
> My challenge with this reasoning is that I don't know that it meanifully
> generalizes to any other capability.
>
> I can in the sandbox today create a user namespace and then set
> max_net_namespaces to 0, and drop CAP_NET_RAW and that blocks
> the attack.  (Possibly with a little spice to prevent a suid root
> program from reacquiring CAP_NET_RAW).
>
This is problematic since you are expecting the user-namespace creator
to perform this operation and then block the child process from
creating the user-namespace. This is similar to making user-namespace
creation a privileged operation discussed previously.

> So while your solution doesn't look horrible especially if it can be
> done at a user namespace level so the restrictions can be limited to a
> single sandbox.  I am not at all certain that the capabilities is the
> proper place to limit code reachability.
>
> I would very much like to see which capabilities that are available with
> ns_capable, are more meaningful to limit than just dropping the
> capability during sandbox creation and denying the creation of the
> corresponding namespace.
>
The primary assumption in this approach is that we can drop
capabilities before running the workload and then not allowing
workload to create the user-namespace. This does not work for cases
where workload needs to create user-namespaces.

> CAP_NET_RAW is one.  Are there any other capabilities that are
> meanginful to limit?
>
There are currently 37 capabilities and I see many of those are
currently namespace aware (with ns_capable() call). Also there seems
to be disproportionate amount of capable() to ns_capable() calls. This
could be a result of not every feature available kernel-wide being
namespace aware/capable etc. and this will evolve and mature i.e.
ns_capable() will continue to grow where this would be applicable.
Also Probably I'm the wrong person to ask this question to since I
understand networking more than anything else. However, the main point
is that we cannot predict which vulnerability is going to get
published tomorrow networking or non-networking, so having a tool that
gives controls to admin while allowing user-namespace creation is
super useful.

thanks,
--mahesh..

> Eric
Serge E. Hallyn Nov. 10, 2017, 4:46 a.m. UTC | #27
Quoting Eric W. Biederman (ebiederm@xmission.com):
> single sandbox.  I am not at all certain that the capabilities is the
> proper place to limit code reachability.

Right, I keep having this gut feeling that there is another way we
should be doing that.  Maybe based on ksplice or perf, or maybe more
based on subsystems.  And I hope someone pursues that.  But I can't put
my finger on it, and meanwhile the capability checks obviously *are* in
fact gates...

-serge
On Fri, Nov 10, 2017 at 1:46 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> single sandbox.  I am not at all certain that the capabilities is the
>> proper place to limit code reachability.
>
> Right, I keep having this gut feeling that there is another way we
> should be doing that.  Maybe based on ksplice or perf, or maybe more
> based on subsystems.  And I hope someone pursues that.  But I can't put
> my finger on it, and meanwhile the capability checks obviously *are* in
> fact gates...
>
Well, I don't mind if there is a better solution available. The
proposed solution is not adding too much or complex code and using a
bit and a sysctl and will be sitting dormant. When we have complete
solution, this addition should not be a burden to maintain because of
it's non-invasive footprint.

I will push the next version of the patch-set that implements Serge's finding.

Thanks,
--mahesh..

[PS: I'll be soon traveling again and moving to an area where
connectivity will be scarce / unreliable. So please expect lot more
delays in my responses.]

> -serge
diff mbox

Patch

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6c0b9677c03f..b8c6cac18658 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -250,6 +250,7 @@  extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 				 void __user *buff, size_t *lenp, loff_t *ppos);
+bool is_capability_controlled(int cap);
 
 extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
 
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..e890fe81b47e 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -22,6 +22,7 @@  struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 };
 
 #define USERNS_SETGROUPS_ALLOWED 1UL
+#define USERNS_CONTROLLED	 2UL
 
 #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
 
@@ -102,6 +103,16 @@  static inline void put_user_ns(struct user_namespace *ns)
 		__put_user_ns(ns);
 }
 
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+	return ns->flags & USERNS_CONTROLLED;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+	ns->flags |= USERNS_CONTROLLED;
+}
+
 struct seq_operations;
 extern const struct seq_operations proc_uid_seq_operations;
 extern const struct seq_operations proc_gid_seq_operations;
@@ -160,6 +171,15 @@  static inline struct ns_common *ns_get_owner(struct ns_common *ns)
 {
 	return ERR_PTR(-EPERM);
 }
+
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+	return false;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+}
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/capability.c b/kernel/capability.c
index 62dbe3350c1b..40a38cc4ff43 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -510,6 +510,11 @@  bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
 }
 
 /* Controlled-userns capabilities routines */
+bool is_capability_controlled(int cap)
+{
+	return !cap_raised(controlled_userns_caps_whitelist, cap);
+}
+
 #ifdef CONFIG_SYSCTL
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 				 void __user *buff, size_t *lenp, loff_t *ppos)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..f393ea5108f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -53,6 +53,9 @@  static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	cred->cap_effective = CAP_FULL_SET;
 	cred->cap_ambient = CAP_EMPTY_SET;
 	cred->cap_bset = CAP_FULL_SET;
+	if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
+	    is_user_ns_controlled(user_ns->parent))
+		mark_user_ns_controlled(user_ns);
 #ifdef CONFIG_KEYS
 	key_put(cred->request_key_auth);
 	cred->request_key_auth = NULL;
diff --git a/security/commoncap.c b/security/commoncap.c
index fc46f5b85251..89103f16ac37 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -73,6 +73,14 @@  int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
 {
 	struct user_namespace *ns = targ_ns;
 
+	/* If the capability is controlled and user-ns that process
+	 * belongs-to is 'controlled' then return EPERM and no need
+	 * to check the user-ns hierarchy.
+	 */
+	if (is_user_ns_controlled(cred->user_ns) &&
+	    is_capability_controlled(cap))
+		return -EPERM;
+
 	/* See if cred has the capability in the target user namespace
 	 * by examining the target user namespace and all of the target
 	 * user namespace's parents.