diff mbox

SELinux lead to soft lockup when pid 1 proceess reap child

Message ID 58736B2E.90201@huawei.com (mailing list archive)
State Rejected
Headers show

Commit Message

yangshukui Jan. 9, 2017, 10:51 a.m. UTC
Pid 1 process (with init_t)  have the right to reap child in host, but 
pid 1 process (such as spc_t, docker use spc_t as container's default type)
may not have the right to reap child in container, if this condition 
occur, it will lead to soft lock up. The following will produce it,

docker run -ti --rm -v /sys/fs/selinux:/sys/fs/selinux fedora:20 bash
[root@b755018fb526 /]# yum install selinux-policy-targeted 
selinux-policy-devel perl-Test-Harness gcc libselinux-devel net-tools 
netlabel_tools iptables git cpan
[root@b755018fb526 /]# git clone 
https://github.com/SELinuxProject/selinux-testsuite.git
[root@b755018fb526 /]# setenforce 0
[root@b755018fb526 /]# runcon -t unconfined_t bash
[root@b755018fb526 /]# genhomedircon
[root@b755018fb526 /]# restorecon -R /
[root@b755018fb526 /]# setenforce 1
[root@b755018fb526 /]# cd /root/selinux-testsuite/
[root@b755018fb526 selinux-testsuite]# make -C policy load
[root@b755018fb526 selinux-testsuite]# make -C tests test
[root@b755018fb526 selinux-testsuite]# exit  #this will lead to soft lockup

before exiting the container, we can also see some zombies:
[root@b755018fb526 selinux-testsuite]# ps -eafZ
LABEL                           UID        PID  PPID  C STIME 
TTY          TIME CMD
...
unconfined_u:unconfined_r:test_fdreceive_server_t:s0 root 215 1  0 05:35 
pts/0 00:00:00 [server] <defunct>
unconfined_u:unconfined_r:test_ptrace_traced_t:s0 root 291 1  0 05:35 
pts/0 00:00:00 [wait] <defunct>
unconfined_u:unconfined_r:test_setnice_set_t:s0 root 374 1  0 05:35 
pts/0 00:00:00 [child] <defunct>

in kernel code,
zap_pid_ns_processes {
       ...
       /* Firstly reap the EXIT_ZOMBIE children we may have. */
       do {
           clear_thread_flag(TIF_SIGPENDING);
           rc = sys_wait4(-1, NULL, __WALL, NULL);
           //sys_wait4 -> do_wait-> 
wait_consider_task->security_task_wait->selinux_task_wait->avc_has_perm_flags->avc_has_perm_noaudit->avc_denied
the return value is -EACCES, unable to return to the expected -ECHILD, 
and leading to the dead cycle.
     } while (rc != -ECHILD);
}

I have a hack like this,
It work but it permit pid 1 process to reap child without selinux check. 
Can we have a better way to handle this problem?

Comments

yangshukui March 9, 2017, 9:03 a.m. UTC | #1
I want to use SELinux in system container and only concern the function 
in the container.
this system container run in vm and every vm has only one system container.

How do I use now?
docker run ... system-contaier /sbin/init
after init is running ,the following service is also running:

#this is the part of service file which will run in container after 
starting the container.
...
semodule -R     #use the policy in container.
restorecon /     #if needed
...

this method seem to work if host os and the docker images use the same 
content for rootfs, but if host use
redhat7 and docker images use centos7, it will deny many normal 
operations , and this let some host service not work.

If SELinux is permissive in host and enforcing in container ,it will 
resolve my problem. Unfortunately,
there is no namespace for SELinux.

Isolate SELinux is difficult and it has a lot of work to do, but is 
easier to isolate selinux_enforcing.

What do you think ?

Think you very much.
Stephen Smalley March 9, 2017, 3:28 p.m. UTC | #2
On Thu, 2017-03-09 at 17:03 +0800, yangshukui wrote:
> I want to use SELinux in system container and only concern the
> function 
> in the container.
> this system container run in vm and every vm has only one system
> container.
> 
> How do I use now?
> docker run ... system-contaier /sbin/init
> after init is running ,the following service is also running:
> 
> #this is the part of service file which will run in container after 
> starting the container.
> ...
> semodule -R     #use the policy in container.
> restorecon /     #if needed
> ...
> 
> this method seem to work if host os and the docker images use the
> same 
> content for rootfs, but if host use
> redhat7 and docker images use centos7, it will deny many normal 
> operations , and this let some host service not work.
> 
> If SELinux is permissive in host and enforcing in container ,it will 
> resolve my problem. Unfortunately,
> there is no namespace for SELinux.
> 
> Isolate SELinux is difficult and it has a lot of work to do, but is 
> easier to isolate selinux_enforcing.
> 
> What do you think ?

I'd rather see proper SELinux policy namespace support implemented.
Admittedly, that won't be straightforward.

FWIW, ChromiumOS appears to have done something similar to what you
suggest for supporting Android containers (i.e. SELinux enforcing for
the Android container, permissive for ChromiumOS processes outside the
container), but they never discussed it with upstream SELinux
developers AFAIK.  My only knowledge of what they have done comes from
their kernel repository [1]. It appears that they experimented with a
hack to narrow the scope of selinux_enforcing to a PID namespace [2],
then reverted that change later and just implemented an option to
suppress audit denials for permissive domains [3] (evidently they are
running the Chromium OS processes in a permissive domain; I haven't
seen their policy).  I wouldn't recommend either approach; the former
won't properly handle permission checks that occur outside of process
context or certain permission checks where the source context is not
the current task context (e.g. an inter-object relationship check),
while the latter requires leaving a permissive domain in the production
policy (which seemingly would violate CTS; not sure why that gets a
pass, and if that is ok, then why didn't they just create a domain
allowed all permissions and use that outside the container instead -
then they won't need to suppress audit at all?) and further requires
use of a separate kernel for policy development/debugging.  Note btw
that they could have silenced the permissive denials via dontaudit
rules instead (as Android does for its su domain) but chose not to do
so to avoid taking the slow path.

[1] https://chromium.googlesource.com/chromiumos/third_party/kernel
[2] https://chromium-review.googlesource.com/c/361464/
[3] https://chromium-review.googlesource.com/c/424948/
Stephen Smalley March 9, 2017, 3:39 p.m. UTC | #3
On Thu, 2017-03-09 at 10:28 -0500, Stephen Smalley wrote:
> On Thu, 2017-03-09 at 17:03 +0800, yangshukui wrote:
> > 
> > I want to use SELinux in system container and only concern the
> > function 
> > in the container.
> > this system container run in vm and every vm has only one system
> > container.
> > 
> > How do I use now?
> > docker run ... system-contaier /sbin/init
> > after init is running ,the following service is also running:
> > 
> > #this is the part of service file which will run in container
> > after 
> > starting the container.
> > ...
> > semodule -R     #use the policy in container.
> > restorecon /     #if needed
> > ...
> > 
> > this method seem to work if host os and the docker images use the
> > same 
> > content for rootfs, but if host use
> > redhat7 and docker images use centos7, it will deny many normal 
> > operations , and this let some host service not work.
> > 
> > If SELinux is permissive in host and enforcing in container ,it
> > will 
> > resolve my problem. Unfortunately,
> > there is no namespace for SELinux.
> > 
> > Isolate SELinux is difficult and it has a lot of work to do, but
> > is 
> > easier to isolate selinux_enforcing.
> > 
> > What do you think ?
> 
> I'd rather see proper SELinux policy namespace support implemented.
> Admittedly, that won't be straightforward.
> 
> FWIW, ChromiumOS appears to have done something similar to what you
> suggest for supporting Android containers (i.e. SELinux enforcing for
> the Android container, permissive for ChromiumOS processes outside
> the
> container), but they never discussed it with upstream SELinux
> developers AFAIK.  My only knowledge of what they have done comes
> from
> their kernel repository [1]. It appears that they experimented with a
> hack to narrow the scope of selinux_enforcing to a PID namespace [2],
> then reverted that change later and just implemented an option to
> suppress audit denials for permissive domains [3] (evidently they are
> running the Chromium OS processes in a permissive domain; I haven't
> seen their policy).  I wouldn't recommend either approach; the former
> won't properly handle permission checks that occur outside of process
> context or certain permission checks where the source context is not
> the current task context (e.g. an inter-object relationship check),
> while the latter requires leaving a permissive domain in the
> production
> policy (which seemingly would violate CTS; not sure why that gets a
> pass, and if that is ok, then why didn't they just create a domain
> allowed all permissions and use that outside the container instead -
> then they won't need to suppress audit at all?) and further requires
> use of a separate kernel for policy development/debugging.  Note btw
> that they could have silenced the permissive denials via dontaudit
> rules instead (as Android does for its su domain) but chose not to do
> so to avoid taking the slow path.

Sorry, should have looked more closely at their actual change - that
last part of their rationale is bogus; a dontaudit rule would have
prevented calling slow_avc_audit() at all, whereas their change merely
returns early from slow_avc_audit().  So I really don't understand why
they didn't just define dontaudit rules for all permissions (if using a
permissive domain) or allow rules for all permissions (if using an
enforcing, allow-all domain).  Neither one is especially hard to write,
and they could have just looked at the su domain in Android for an
example of the former.

> 
> [1] https://chromium.googlesource.com/chromiumos/third_party/kernel
> [2] https://chromium-review.googlesource.com/c/361464/
> [3] https://chromium-review.googlesource.com/c/424948/
Casey Schaufler March 9, 2017, 4:39 p.m. UTC | #4
On 3/9/2017 1:03 AM, yangshukui wrote:
> I want to use SELinux in system container and only concern the function in the container.
> this system container run in vm and every vm has only one system container.
>
> How do I use now?
> docker run ... system-contaier /sbin/init
> after init is running ,the following service is also running:
>
> #this is the part of service file which will run in container after starting the container.
> ..
> semodule -R     #use the policy in container.
> restorecon /     #if needed
> ..
>
> this method seem to work if host os and the docker images use the same content for rootfs, but if host use
> redhat7 and docker images use centos7, it will deny many normal operations , and this let some host service not work.
>
> If SELinux is permissive in host and enforcing in container ,it will resolve my problem. Unfortunately,
> there is no namespace for SELinux.

The LSM infrastructure is essentially a set of lists.
These lists are rooted globally, but there's no reason*
they couldn't be rooted in a namespace. That would give
each namespace the option of using whatever security
scheme was deemed appropriate. There are a number of
issues, such as namespacing policy, that would have to
be addressed, but the mechanism could work fine. I would
look at patches.

---
* Other than the sheer insanity of making security
  claims about such a system. I would not expect that
  minor issue to slow demand or deployment any more
  than it has in the past.

>
> Isolate SELinux is difficult and it has a lot of work to do, but is easier to isolate selinux_enforcing.
>
> What do you think ?
>
> Think you very much.
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Eric W. Biederman March 9, 2017, 8:49 p.m. UTC | #5
Casey Schaufler <casey@schaufler-ca.com> writes:

> On 3/9/2017 1:03 AM, yangshukui wrote:
>> I want to use SELinux in system container and only concern the function in the container.
>> this system container run in vm and every vm has only one system container.
>>
>> How do I use now?
>> docker run ... system-contaier /sbin/init
>> after init is running ,the following service is also running:
>>
>> #this is the part of service file which will run in container after starting the container.
>> ..
>> semodule -R     #use the policy in container.
>> restorecon /     #if needed
>> ..
>>
>> this method seem to work if host os and the docker images use the same content for rootfs, but if host use
>> redhat7 and docker images use centos7, it will deny many normal operations , and this let some host service not work.
>>
>> If SELinux is permissive in host and enforcing in container ,it will resolve my problem. Unfortunately,
>> there is no namespace for SELinux.

This is mostly a SELinux problem.

> The LSM infrastructure is essentially a set of lists.
> These lists are rooted globally, but there's no reason*
> they couldn't be rooted in a namespace. That would give
> each namespace the option of using whatever security
> scheme was deemed appropriate. There are a number of
> issues, such as namespacing policy, that would have to
> be addressed, but the mechanism could work fine. I would
> look at patches.

>
> ---
> * Other than the sheer insanity of making security
>   claims about such a system. I would not expect that
>   minor issue to slow demand or deployment any more
>   than it has in the past.

I would tend to insist that the container local policy stacks inside the
global policy.  So that at the least the global security claims would
not be reduced.

My expectation is that a container would run as essentially all one
label from a global perspective.

To implement this would require a revision on the selinux labels xattrs
so that they can be marked as being part of a container...  But having
the labels look ordinary inside the container.

We almost have a patch that implements something like that for the
capability xattr.

Eric
Paul Moore March 10, 2017, 12:05 a.m. UTC | #6
On Thu, Mar 9, 2017 at 3:49 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
>
>> On 3/9/2017 1:03 AM, yangshukui wrote:
>>> I want to use SELinux in system container and only concern the function in the container.
>>> this system container run in vm and every vm has only one system container.
>>>
>>> How do I use now?
>>> docker run ... system-contaier /sbin/init
>>> after init is running ,the following service is also running:
>>>
>>> #this is the part of service file which will run in container after starting the container.
>>> ..
>>> semodule -R     #use the policy in container.
>>> restorecon /     #if needed
>>> ..
>>>
>>> this method seem to work if host os and the docker images use the same content for rootfs, but if host use
>>> redhat7 and docker images use centos7, it will deny many normal operations , and this let some host service not work.
>>>
>>> If SELinux is permissive in host and enforcing in container ,it will resolve my problem. Unfortunately,
>>> there is no namespace for SELinux.
>
> This is mostly a SELinux problem.
>
>> The LSM infrastructure is essentially a set of lists.
>> These lists are rooted globally, but there's no reason*
>> they couldn't be rooted in a namespace. That would give
>> each namespace the option of using whatever security
>> scheme was deemed appropriate. There are a number of
>> issues, such as namespacing policy, that would have to
>> be addressed, but the mechanism could work fine. I would
>> look at patches.
>
>>
>> ---
>> * Other than the sheer insanity of making security
>>   claims about such a system. I would not expect that
>>   minor issue to slow demand or deployment any more
>>   than it has in the past.
>
> I would tend to insist that the container local policy stacks inside the
> global policy.  So that at the least the global security claims would
> not be reduced.

My current thinking is that namespacing is best left to the individual
LSMs, as it is unlikely we will all want to solve it the same way.
With SELinux we already have some basic support for what Eric
describes via bounded domains, but that alone isn't likely to solve
SELinux inside containers in a sense that most would expect; for that
you will need what Stephen already described.
James Morris March 13, 2017, 7:06 a.m. UTC | #7
On Thu, 9 Mar 2017, Eric W. Biederman wrote:

> My expectation is that a container would run as essentially all one
> label from a global perspective.
> 

Keep in mind that a different classes of objects may have distinct 
labeling in SELinux.  e.g. a process and a file typically have different 
labels (say, sshd_t vs. sshd_key_t).

Also, I think you will want to have the global namespace always use the 
original security labels.  If accessing an object from outside the 
container, the original global policy should always apply.  Really, this 
needs to be an invariant property.

I'd suggest implementing an orthogonal 2nd set of security labels which 
are only ever used within the container.


> To implement this would require a revision on the selinux labels xattrs
> so that they can be marked as being part of a container...  But having
> the labels look ordinary inside the container.
> 
> We almost have a patch that implements something like that for the
> capability xattr.

It'll be interesting to see.
Casey Schaufler March 13, 2017, 4:05 p.m. UTC | #8
On 3/13/2017 12:06 AM, James Morris wrote:
> On Thu, 9 Mar 2017, Eric W. Biederman wrote:
>
>> My expectation is that a container would run as essentially all one
>> label from a global perspective.
>>
> Keep in mind that a different classes of objects may have distinct 
> labeling in SELinux.  e.g. a process and a file typically have different 
> labels (say, sshd_t vs. sshd_key_t).
>
> Also, I think you will want to have the global namespace always use the 
> original security labels.  If accessing an object from outside the 
> container, the original global policy should always apply.  Really, this 
> needs to be an invariant property.
>
> I'd suggest implementing an orthogonal 2nd set of security labels which 
> are only ever used within the container.

The work that's been done for Smack namespaces

	https://lwn.net/Articles/652320

may come in handy during during your deliberations for
SELinux. Conceptually you can create aliases for your
base labels, and use those within the container. Very
much like the UID mapping of user namespaces. Labels that
don't have an alias can't be accessed within the namespace.

>> To implement this would require a revision on the selinux labels xattrs
>> so that they can be marked as being part of a container...  But having
>> the labels look ordinary inside the container.
>>
>> We almost have a patch that implements something like that for the
>> capability xattr.
> It'll be interesting to see.
>
diff mbox

Patch

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 57a2020..c10c58c 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3596,6 +3596,9 @@  static int selinux_task_kill(struct task_struct 
*p, struct siginfo *info,

  static int selinux_task_wait(struct task_struct *p)
  {
+       if (pid_vnr(task_tgid(current)) == 1){
+                return 0;
+       }
         return task_has_perm(p, current, PROCESS__SIGCHLD);
  }