mbox series

[0/3] Allow initializing the kernfs node's secctx based on its parent

Message ID 20190109091028.24485-1-omosnace@redhat.com (mailing list archive)
Headers show
Series Allow initializing the kernfs node's secctx based on its parent | expand

Message

Ondrej Mosnacek Jan. 9, 2019, 9:10 a.m. UTC
This series adds a new security hook that allows to initialize the security
context of kernfs properly, taking into account the parent context. Kernfs
nodes require special handling here, since they are not bound to specific
inodes/superblocks, but instead represent the backing tree structure that
is used to build the VFS tree when the kernfs tree is mounted.

The kernnfs nodes initially do not store any security context and rely on
the LSM to assign some default context to inodes created over them. Kernfs
inodes, however, allow setting an explicit context via the *setxattr(2)
syscalls, in which case the context is stored inside the kernfs node's
metadata.

SELinux (and possibly other LSMs) initialize the context of newly created
FS objects based on the parent object's context (usually the child inherits
the parent's context, unless the policy dictates otherwise). This is done
by hooking the creation of the new inode corresponding to the newly created
file/directory via security_inode_init_security() (most filesystems always
create a fresh inode when a new FS object is created). However, kernfs nodes
can be created "behind the scenes" while the filesystem is not mounted
anywhere and thus no inodes exist.

Therefore, to allow maintaining similar behavior for kernfs nodes, a new LSM
hook is needed, which would allow initializing the kernfs node's security
context based on the context stored in the parent's node (if any).

The main motivation for this change is that the userspace users of cgroupfs
(which is built on kernfs) expect the usual security context inheritance
to work under SELinux (see [1] and [2]). This functionality is required for
better confinement of containers under SELinux.

The first patch adds the new LSM hook; the second patch implements the hook
in SELinux; and the third patch modifies kernfs to use the new hook to
initialize the security context of kernfs nodes whenever its parent node
has a non-default context set.

Note: the patches are based on current selinux/next [3], but they seem to
apply cleanly on top of v5.0-rc1 as well.

Testing:
- passed SELinux testsuite on Fedora 29 (x86_64) when applied on top of
  current Rawhide kernel (5.0.0-0.rc1.git0.1) [4]
- passed the reproducer from the last patch

[1] https://github.com/SELinuxProject/selinux-kernel/issues/39
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1553803
[3] https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/log/?h=selinux-pr-20181224
[4] https://copr.fedorainfracloud.org/coprs/omos/kernel-testing/build/842855/

Ondrej Mosnacek (3):
  LSM: Add new hook for generic node initialization
  selinux: Implement the object_init_security hook
  kernfs: Initialize security of newly created nodes

 fs/kernfs/dir.c             | 49 ++++++++++++++++++++++++++++++++++---
 fs/kernfs/inode.c           |  9 +++----
 fs/kernfs/kernfs-internal.h |  4 +++
 include/linux/lsm_hooks.h   |  5 ++++
 include/linux/security.h    | 12 +++++++++
 security/security.c         |  8 ++++++
 security/selinux/hooks.c    | 41 +++++++++++++++++++++++++++++++
 7 files changed, 120 insertions(+), 8 deletions(-)

Comments

Tejun Heo Jan. 11, 2019, 8:50 p.m. UTC | #1
Hello,

On Wed, Jan 09, 2019 at 10:10:25AM +0100, Ondrej Mosnacek wrote:
> The main motivation for this change is that the userspace users of cgroupfs
> (which is built on kernfs) expect the usual security context inheritance
> to work under SELinux (see [1] and [2]). This functionality is required for
> better confinement of containers under SELinux.

Can you please go into details on what the expected use cases are like
for cgroupfs?  It shows up as a filesystem but isn't a real one and
has its own permission scheme for delegation and stuff.  If sysfs
hasn't needed selinux support, I'm having a bit of difficulty seeing
why cgroupfs would.

Thanks.
Ondrej Mosnacek Jan. 14, 2019, 9:14 a.m. UTC | #2
On Fri, Jan 11, 2019 at 9:51 PM Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, Jan 09, 2019 at 10:10:25AM +0100, Ondrej Mosnacek wrote:
> > The main motivation for this change is that the userspace users of cgroupfs
> > (which is built on kernfs) expect the usual security context inheritance
> > to work under SELinux (see [1] and [2]). This functionality is required for
> > better confinement of containers under SELinux.
>
> Can you please go into details on what the expected use cases are like
> for cgroupfs?  It shows up as a filesystem but isn't a real one and
> has its own permission scheme for delegation and stuff.  If sysfs
> hasn't needed selinux support, I'm having a bit of difficulty seeing
> why cgroupfs would.

I'm not sure what are the exact needs of the container people, but
IIUC the goal is to make it possible to have a subtree labeled with a
specific label (that gets inherited by newly created cgroups in that
subtree by default) so that container processes do not need to be
given permissions for the whole cgroupfs tree.

I'm cc'ing Dan Walsh, who should be able to explain the use cases in
more details. Dan, this is related to the cgroupfs labeling problem
([1] and [2]). See [3] for the root of this discussion.

[1] https://github.com/SELinuxProject/selinux-kernel/issues/39
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1553803
[3] https://lore.kernel.org/selinux/CAFqZXNsxfjwDaCWDrqxP736y_3Jm-r=twaHtkkTDtMuym774Jw@mail.gmail.com/T/


--
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.
Ondrej Mosnacek Jan. 14, 2019, 9:29 a.m. UTC | #3
On Mon, Jan 14, 2019 at 10:14 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> [...]
> [3] https://lore.kernel.org/selinux/CAFqZXNsxfjwDaCWDrqxP736y_3Jm-r=twaHtkkTDtMuym774Jw@mail.gmail.com/T/

Actually, this thread belongs to v1 of the patch series, which is archived here:
https://lore.kernel.org/selinux/CAFqZXNu-bHGmUi80UiyW3djcbedycC+0KUyiQuv9-8b+WmrYuA@mail.gmail.com/T/
Tejun Heo Jan. 14, 2019, 3:50 p.m. UTC | #4
Hello,

On Mon, Jan 14, 2019 at 10:14:32AM +0100, Ondrej Mosnacek wrote:
> I'm not sure what are the exact needs of the container people, but
> IIUC the goal is to make it possible to have a subtree labeled with a
> specific label (that gets inherited by newly created cgroups in that
> subtree by default) so that container processes do not need to be
> given permissions for the whole cgroupfs tree.
> 
> I'm cc'ing Dan Walsh, who should be able to explain the use cases in
> more details. Dan, this is related to the cgroupfs labeling problem
> ([1] and [2]). See [3] for the root of this discussion.

Let's wait for Dan to respond but I'm pretty skeptical that this is a
good direction.

Thanks.
Stephen Smalley Jan. 15, 2019, 2:36 p.m. UTC | #5
On 1/11/19 3:50 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 09, 2019 at 10:10:25AM +0100, Ondrej Mosnacek wrote:
>> The main motivation for this change is that the userspace users of cgroupfs
>> (which is built on kernfs) expect the usual security context inheritance
>> to work under SELinux (see [1] and [2]). This functionality is required for
>> better confinement of containers under SELinux.
> 
> Can you please go into details on what the expected use cases are like
> for cgroupfs?  It shows up as a filesystem but isn't a real one and
> has its own permission scheme for delegation and stuff.  If sysfs
> hasn't needed selinux support, I'm having a bit of difficulty seeing
> why cgroupfs would.

Just to clarify with respect to your last point about sysfs, sysfs 
selinux support was first introduced in commit ddd29ec6597125c830f7 
("sysfs: Add labeling support for sysfs") for use by libvirt, and this 
support was carried over into kernfs, and is extensively used 
particularly in Android for controlling access to sysfs files.  The 
patch set in this series is extending that support to enable inheritance 
of security labels set via setxattr from parent to child when 
appropriate, which has particularly been requested for cgroup but would 
also be useful for sysfs.
Tejun Heo Jan. 17, 2019, 4:15 p.m. UTC | #6
Hello,

On Thu, Jan 17, 2019 at 10:01:23AM -0500, Daniel Walsh wrote:
> The above comment is correct.  We want to be able to run a container
> where we hand it control over a limited subdir of the cgroups hierachy. 
> We can currently do this and label the content correctly, but when
> subdirs of the directory get created by processes inside the container
> they do not get the correct label.  For example we add a label like
> system_u:object_r:container_file_t:s0 to a directory but when the
> process inside of the container creates a fd within this directory the
> kernel says the label is the default label for cgroups
> system_u:object_r:cgroup_t:s0.  This forces us to write looser policy
> that from an SELinux point of view allows a process within the container
> to write anywhere on the cgroup file system, rather then just the
> designated directories.

Can you please go into a bit more details on why the existing
cgroup delegation model isn't enough?

Thanks.
Stephen Smalley Jan. 17, 2019, 4:39 p.m. UTC | #7
On 1/17/19 11:15 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jan 17, 2019 at 10:01:23AM -0500, Daniel Walsh wrote:
>> The above comment is correct.  We want to be able to run a container
>> where we hand it control over a limited subdir of the cgroups hierachy.
>> We can currently do this and label the content correctly, but when
>> subdirs of the directory get created by processes inside the container
>> they do not get the correct label.  For example we add a label like
>> system_u:object_r:container_file_t:s0 to a directory but when the
>> process inside of the container creates a fd within this directory the
>> kernel says the label is the default label for cgroups
>> system_u:object_r:cgroup_t:s0.  This forces us to write looser policy
>> that from an SELinux point of view allows a process within the container
>> to write anywhere on the cgroup file system, rather then just the
>> designated directories.
> 
> Can you please go into a bit more details on why the existing
> cgroup delegation model isn't enough?

I would hazard a guess that it is because the existing cgroup delegation 
model is based on user IDs and discretionary access control (DAC), 
whereas they are using per-container SELinux security contexts and 
mandatory access control (MAC) to enforce the separation of containers 
irrespective of UID and DAC.  Optimally both would be supported by 
cgroup, as DAC and MAC have different properties and use cases.
Daniel Walsh Jan. 17, 2019, 8:30 p.m. UTC | #8
On 1/17/19 11:39 AM, Stephen Smalley wrote:
> On 1/17/19 11:15 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Jan 17, 2019 at 10:01:23AM -0500, Daniel Walsh wrote:
>>> The above comment is correct.  We want to be able to run a container
>>> where we hand it control over a limited subdir of the cgroups hierachy.
>>> We can currently do this and label the content correctly, but when
>>> subdirs of the directory get created by processes inside the container
>>> they do not get the correct label.  For example we add a label like
>>> system_u:object_r:container_file_t:s0 to a directory but when the
>>> process inside of the container creates a fd within this directory the
>>> kernel says the label is the default label for cgroups
>>> system_u:object_r:cgroup_t:s0.  This forces us to write looser policy
>>> that from an SELinux point of view allows a process within the
>>> container
>>> to write anywhere on the cgroup file system, rather then just the
>>> designated directories.
>>
>> Can you please go into a bit more details on why the existing
>> cgroup delegation model isn't enough?
>
> I would hazard a guess that it is because the existing cgroup
> delegation model is based on user IDs and discretionary access control
> (DAC), whereas they are using per-container SELinux security contexts
> and mandatory access control (MAC) to enforce the separation of
> containers irrespective of UID and DAC.  Optimally both would be
> supported by cgroup, as DAC and MAC have different properties and use
> cases.

As Steven said, existing model is DAC.  We have the situation where we
have a "root" process running within a container that is not using User
Namespace.  I want to control that that root process can not write to
anywhere within the cgroup hierarchy based on SELinux controls.   This
is security in depth.  If other mechanisms prevent the process from
writing to other places in cgroups that is great, but I want it also
secured from a MAC Point of view.
Daniel Walsh Jan. 17, 2019, 8:35 p.m. UTC | #9
On 1/17/19 11:15 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jan 17, 2019 at 10:01:23AM -0500, Daniel Walsh wrote:
>> The above comment is correct.  We want to be able to run a container
>> where we hand it control over a limited subdir of the cgroups hierachy. 
>> We can currently do this and label the content correctly, but when
>> subdirs of the directory get created by processes inside the container
>> they do not get the correct label.  For example we add a label like
>> system_u:object_r:container_file_t:s0 to a directory but when the
>> process inside of the container creates a fd within this directory the
>> kernel says the label is the default label for cgroups
>> system_u:object_r:cgroup_t:s0.  This forces us to write looser policy
>> that from an SELinux point of view allows a process within the container
>> to write anywhere on the cgroup file system, rather then just the
>> designated directories.
> Can you please go into a bit more details on why the existing
> cgroup delegation model isn't enough?
>
> Thanks.
>
If I label a container container_t:s0:c1,c2 by policy it can only write
to container_file_t:s0:c1,c2.  So the container engine sets up files and
directories within the cgroup hierarchy with labels of
container_file_t:s0:c1,c2.  When the container writes to one of these
directories, the kernel says the file is labeled cgroup_t:s0, and is
denied by policy.  In most/all other file systems that support labeling,
the content of a directory gets the same label as the containing
directory.  So from an SELinux point of view, I would have expected the
kernel to label the new file as container_file_t:s0:c1,c2 and everything
would work securely.  But cgroups does not work correctly so we need to
add a rule that says container_t:s0:c1,c2 can write files labeles
cgroup_t:s0 which means it can write anywhere on /sys/fs/cgroup.

This is from a MAC Point of view.   I don't care  if other security
measure might control this, I want to have security in depth and have
MAC Control it.