diff mbox

[RFC,v2,09/10] landlock: Handle cgroups

Message ID 1472121165-29071-10-git-send-email-mic@digikod.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mickaël Salaün Aug. 25, 2016, 10:32 a.m. UTC
Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
to compare the current process cgroup with a cgroup handle, The handle
can match the current cgroup if it is the same or a child. This allows
to make conditional rules according to the current cgroup.

A cgroup handle is a map entry created from a file descriptor referring
a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.

An unprivileged process can create and manipulate cgroups thanks to
cgroup delegation.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/bpf.h                |  8 ++++
 include/uapi/linux/bpf.h           | 15 ++++++
 kernel/bpf/arraymap.c              | 30 ++++++++++++
 kernel/bpf/verifier.c              |  6 +++
 security/landlock/Kconfig          |  3 ++
 security/landlock/Makefile         |  2 +-
 security/landlock/checker_cgroup.c | 96 ++++++++++++++++++++++++++++++++++++++
 security/landlock/checker_cgroup.h | 18 +++++++
 security/landlock/lsm.c            |  8 ++++
 9 files changed, 185 insertions(+), 1 deletion(-)
 create mode 100644 security/landlock/checker_cgroup.c
 create mode 100644 security/landlock/checker_cgroup.h

Comments

Andy Lutomirski Aug. 25, 2016, 11:09 a.m. UTC | #1
On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün <mic@digikod.net> wrote:
> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
> to compare the current process cgroup with a cgroup handle, The handle
> can match the current cgroup if it is the same or a child. This allows
> to make conditional rules according to the current cgroup.
>
> A cgroup handle is a map entry created from a file descriptor referring
> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.

Can you elaborate on why this is useful?  I.e. why not just supply
different policies to different subtrees.

Also, how does this interact with the current cgroup v1 vs v2 mess?
As far as I can tell, no one can even really agree on what "what
cgroup am I in" means right now.

>
> An unprivileged process can create and manipulate cgroups thanks to
> cgroup delegation.

What is cgroup delegation?

--Andy
Mickaël Salaün Aug. 25, 2016, 2:44 p.m. UTC | #2
On 25/08/2016 13:09, Andy Lutomirski wrote:
> On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün <mic@digikod.net> wrote:
>> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
>> to compare the current process cgroup with a cgroup handle, The handle
>> can match the current cgroup if it is the same or a child. This allows
>> to make conditional rules according to the current cgroup.
>>
>> A cgroup handle is a map entry created from a file descriptor referring
>> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
>> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
>> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
> 
> Can you elaborate on why this is useful?  I.e. why not just supply
> different policies to different subtrees.

The main use case I see is to load the security policies at the start of
a user session for all processes but not enforce them right away. The
user can then keep a shell for Landlock administration tasks and lock
the other processes with a dedicated cgroup on the fly. This allows the
user to make unremovable Landlock security policies but only activate
them when needed for specific processes.

> 
> Also, how does this interact with the current cgroup v1 vs v2 mess?
> As far as I can tell, no one can even really agree on what "what
> cgroup am I in" means right now.

I tested with cgroup-v2 but indeed, it seems a bit different with
cgroup-v1 :)
Does anyone know how to handle both cases?

> 
>>
>> An unprivileged process can create and manipulate cgroups thanks to
>> cgroup delegation.
> 
> What is cgroup delegation?

This is simply the action of changing the owner of cgroup sysfs files to
allow an unprivileged user to handle them (cf. Documentation/cgroup-v2.txt)

 Mickaël
Alexei Starovoitov Aug. 26, 2016, 2:14 a.m. UTC | #3
On Thu, Aug 25, 2016 at 12:32:44PM +0200, Mickaël Salaün wrote:
> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
> to compare the current process cgroup with a cgroup handle, The handle
> can match the current cgroup if it is the same or a child. This allows
> to make conditional rules according to the current cgroup.
> 
> A cgroup handle is a map entry created from a file descriptor referring
> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
> 
> An unprivileged process can create and manipulate cgroups thanks to
> cgroup delegation.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
...
> +static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
> +		u64 r3_map_op, u64 r4, u64 r5)
> +{
> +	u8 option = (u8) r1_option;
> +	struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> +	enum bpf_map_array_op map_op = r3_map_op;
> +	struct bpf_array *array = container_of(map, struct bpf_array, map);
> +	struct cgroup *cg1, *cg2;
> +	struct map_landlock_handle *handle;
> +	int i;
> +
> +	/* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
> +	if (unlikely(!map)) {
> +		WARN_ON(1);
> +		return -EFAULT;
> +	}
> +	if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != _LANDLOCK_FLAG_OPT_MASK))
> +		return -EINVAL;
> +
> +	/* for now, only handle OP_OR */
> +	switch (map_op) {
> +	case BPF_MAP_ARRAY_OP_OR:
> +		break;
> +	case BPF_MAP_ARRAY_OP_UNSPEC:
> +	case BPF_MAP_ARRAY_OP_AND:
> +	case BPF_MAP_ARRAY_OP_XOR:
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	synchronize_rcu();
> +
> +	for (i = 0; i < array->n_entries; i++) {
> +		handle = (struct map_landlock_handle *)
> +				(array->value + array->elem_size * i);
> +
> +		/* protected by the proto types, should not happen */
> +		if (unlikely(handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
> +			WARN_ON(1);
> +			return -EFAULT;
> +		}
> +		if (unlikely(!handle->css)) {
> +			WARN_ON(1);
> +			return -EFAULT;
> +		}
> +
> +		if (option & LANDLOCK_FLAG_OPT_REVERSE) {
> +			cg1 = handle->css->cgroup;
> +			cg2 = task_css_set(current)->dfl_cgrp;
> +		} else {
> +			cg1 = task_css_set(current)->dfl_cgrp;
> +			cg2 = handle->css->cgroup;
> +		}
> +
> +		if (cgroup_is_descendant(cg1, cg2))
> +			return 0;
> +	}
> +	return 1;
> +}

- please take a loook at exisiting bpf_current_task_under_cgroup and
reuse BPF_MAP_TYPE_CGROUP_ARRAY as a minimum. Doing new cgroup array
is nothing but duplication of the code.

- I don't think such 'for' loop can scale. The solution needs to work
with thousands of containers and thousands of cgroups.
In the patch 06/10 the proposal is to use 'current' as holder of
the programs:
+   for (prog = current->seccomp.landlock_prog;
+                   prog; prog = prog->prev) {
+           if (prog->filter == landlock_ret->filter) {
+                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
+                   break;
+           }
+   }
imo that's the root of scalability issue.
I think to be able to scale the bpf programs have to be attached to
cgroups instead of tasks.
That would be very different api. seccomp doesn't need to be touched.
But that is the only way I see to be able to scale.
May be another way of thinking about it is 'lsm cgroup controller'
that Sargun is proposing.
The lsm hooks will provide stable execution points and the programs
will be called like:
prog = task_css_set(current)->dfl_cgrp->bpf.prog_effective[lsm_hook_id];
BPF_PROG_RUN(prog, ctx);
The delegation functionality and 'prog_effective' logic that
Daniel Mack is proposing will be fully reused here.
External container management software will be able to apply bpf
programs to control tasks under cgroup and such
bpf_landlock_cmp_cgroup_beneath() helper won't be necessary.
The user will be able to register different programs for different lsm hooks.
If I understand the patch 6/10 correctly, there is one (or a list) prog for
all lsm hooks per task which is not flexible enough.
Anoop Naravaram's use case is to control the ports the applications
under cgroup can bind and listen on.
Such use case can be solved by such 'lsm cgroup controller' by
attaching bpf program to security_socket_bind lsm hook and
filtering sockaddr.
Furthermore Sargun's use case is to allow further sockaddr rewrites
from the bpf program which can be done as natural extension
of such mechanism.

If I understood Daniel's Anoop's Sargun's and yours use cases
correctly the common piece of kernel infrastructure that can solve
them all can start from Daniel's current set of patches that
establish a mechanism of attaching bpf program to a cgroup.
Then adding lsm hooks to it and later allowing argument rewrite
(since they're already in the kernel and no ToCToU problems exist)

As far as safety and type checking that bpf programs has to do,
I like the approach of patch 06/10:
+LANDLOCK_HOOK2(file_open, FILE_OPEN,
+       PTR_TO_STRUCT_FILE, struct file *, file,
+       PTR_TO_STRUCT_CRED, const struct cred *, cred
+)
teaching verifier to recognize struct file, cred, sockaddr
will let bpf program access them naturally without any overhead.
Though:
@@ -102,6 +102,9 @@ enum bpf_prog_type {
        BPF_PROG_TYPE_SCHED_CLS,
        BPF_PROG_TYPE_SCHED_ACT,
        BPF_PROG_TYPE_TRACEPOINT,
+       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
+       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
+       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
 };
is a bit of overkill.
I think it would be cleaner to have single
BPF_PROG_TYPE_LSM and at program load time pass
lsm_hook_id as well, so that verifier can do safety checks
based on type info provided in LANDLOCK_HOOKs
Tejun Heo Aug. 26, 2016, 12:55 p.m. UTC | #4
Hello,

On Thu, Aug 25, 2016 at 04:44:13PM +0200, Mickaël Salaün wrote:
> I tested with cgroup-v2 but indeed, it seems a bit different with
> cgroup-v1 :)
> Does anyone know how to handle both cases?

If you wanna do cgroup membership test, just do cgroup v2 membership
test.  No need to introduce a new controller and possibly struct sock
association field for that.  That's what all new cgroup aware network
operations are using anyway and doesn't conflicts with whether other
controllers are v1 or v2.

For examples of using cgroup v2 membership test, please take a look at
cgroup_mt_v1().

Thanks.
Andy Lutomirski Aug. 26, 2016, 2:20 p.m. UTC | #5
On Thu, Aug 25, 2016 at 7:44 AM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 25/08/2016 13:09, Andy Lutomirski wrote:
>> On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
>>> to compare the current process cgroup with a cgroup handle, The handle
>>> can match the current cgroup if it is the same or a child. This allows
>>> to make conditional rules according to the current cgroup.
>>>
>>> A cgroup handle is a map entry created from a file descriptor referring
>>> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
>>> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
>>> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
>>
>> Can you elaborate on why this is useful?  I.e. why not just supply
>> different policies to different subtrees.
>
> The main use case I see is to load the security policies at the start of
> a user session for all processes but not enforce them right away. The
> user can then keep a shell for Landlock administration tasks and lock
> the other processes with a dedicated cgroup on the fly. This allows the
> user to make unremovable Landlock security policies but only activate
> them when needed for specific processes.

This seems like a bit of a dubious use case to me.  The landlock
mechanism should be flexible enough to do this kind of thing even
without cgroups, and "spawn a process, wait a while, and then confine
it by fiddling with cgroups" seems a lot dicier than just loading the
right policy in the first place, especially since eBPF policies can be
stateful.

>
>>
>> Also, how does this interact with the current cgroup v1 vs v2 mess?
>> As far as I can tell, no one can even really agree on what "what
>> cgroup am I in" means right now.
>
> I tested with cgroup-v2 but indeed, it seems a bit different with
> cgroup-v1 :)
> Does anyone know how to handle both cases?
>
>>
>>>
>>> An unprivileged process can create and manipulate cgroups thanks to
>>> cgroup delegation.
>>
>> What is cgroup delegation?
>
> This is simply the action of changing the owner of cgroup sysfs files to
> allow an unprivileged user to handle them (cf. Documentation/cgroup-v2.txt)

As far as I can tell, Tejun and systemd both actively discourage doing
this.  Maybe I misunderstand.  But in any event, the admin giving you
a cgroup hierarchy you can use for this means that the admin has to
cooperate with your policy, and it further requires (with cgroup v2 or
similar, which is most likely the future) that your lockdown policy be
compatible with your resource control policy.

I would suggest dropping this lockdown feature until a use case
emerges that really can't be addressed adequately without it.

--Andy
Mickaël Salaün Aug. 26, 2016, 3:10 p.m. UTC | #6
On 26/08/2016 04:14, Alexei Starovoitov wrote:
> On Thu, Aug 25, 2016 at 12:32:44PM +0200, Mickaël Salaün wrote:
>> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
>> to compare the current process cgroup with a cgroup handle, The handle
>> can match the current cgroup if it is the same or a child. This allows
>> to make conditional rules according to the current cgroup.
>>
>> A cgroup handle is a map entry created from a file descriptor referring
>> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
>> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
>> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
>>
>> An unprivileged process can create and manipulate cgroups thanks to
>> cgroup delegation.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> ...
>> +static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
>> +		u64 r3_map_op, u64 r4, u64 r5)
>> +{
>> +	u8 option = (u8) r1_option;
>> +	struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
>> +	enum bpf_map_array_op map_op = r3_map_op;
>> +	struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +	struct cgroup *cg1, *cg2;
>> +	struct map_landlock_handle *handle;
>> +	int i;
>> +
>> +	/* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
>> +	if (unlikely(!map)) {
>> +		WARN_ON(1);
>> +		return -EFAULT;
>> +	}
>> +	if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != _LANDLOCK_FLAG_OPT_MASK))
>> +		return -EINVAL;
>> +
>> +	/* for now, only handle OP_OR */
>> +	switch (map_op) {
>> +	case BPF_MAP_ARRAY_OP_OR:
>> +		break;
>> +	case BPF_MAP_ARRAY_OP_UNSPEC:
>> +	case BPF_MAP_ARRAY_OP_AND:
>> +	case BPF_MAP_ARRAY_OP_XOR:
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	synchronize_rcu();
>> +
>> +	for (i = 0; i < array->n_entries; i++) {
>> +		handle = (struct map_landlock_handle *)
>> +				(array->value + array->elem_size * i);
>> +
>> +		/* protected by the proto types, should not happen */
>> +		if (unlikely(handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
>> +			WARN_ON(1);
>> +			return -EFAULT;
>> +		}
>> +		if (unlikely(!handle->css)) {
>> +			WARN_ON(1);
>> +			return -EFAULT;
>> +		}
>> +
>> +		if (option & LANDLOCK_FLAG_OPT_REVERSE) {
>> +			cg1 = handle->css->cgroup;
>> +			cg2 = task_css_set(current)->dfl_cgrp;
>> +		} else {
>> +			cg1 = task_css_set(current)->dfl_cgrp;
>> +			cg2 = handle->css->cgroup;
>> +		}
>> +
>> +		if (cgroup_is_descendant(cg1, cg2))
>> +			return 0;
>> +	}
>> +	return 1;
>> +}
> 
> - please take a loook at exisiting bpf_current_task_under_cgroup and
> reuse BPF_MAP_TYPE_CGROUP_ARRAY as a minimum. Doing new cgroup array
> is nothing but duplication of the code.

Oh, I didn't know about this patchset and the new helper. Indeed, it
looks a lot like mine except there is no static verification of the map
type as I did with the arraymap of handles, and no batch mode either. I
think the return value of bpf_current_task_under_cgroup is error-prone
if an eBPF program do an "if(ret)" test on the value (because of the
negative ERRNO return value). Inverting the 0 and 1 return values should
fix this (0 == succeed, 1 == failed, <0 == error).


To sum up, there is four related patchsets:
* "Landlock LSM: Unprivileged sandboxing" (this series)
* "Add Checmate, BPF-driven minor LSM" (Sargun Dhillon)
* "Networking cgroup controller" (Anoop Naravaram)
* "Add eBPF hooks for cgroups" (Daniel Mack)

The three other series (Sargun's, Anoop's and Daniel's) are mainly
focused on network access-control via cgroup for *containers*. As far as
I can tell, only a *root* user (CAP_SYS_ADMIN) can use them. Landlock's
goal is to empower all processes (privileged or not) to create their own
sandbox. This also means, like explained in "[RFC v2 00/10] Landlock
LSM: Unprivileged sandboxing", there is more constraints. For example,
it is not acceptable to let a process probe the kernel memory as it
wish. More details are in the Landlock cover-letter.


Another important point is that supporting cgroup for Landlock is
optional. It does not rely on cgroup to be usable but is only a feature
available when (unprivileged) users can manage there own cgroup, which
is an important constraint. Put another way, Landlock should not rely on
cgroup to create sandboxes. Indeed, a process creating a sandbox do not
necessarily have access to the cgroup mount point (directly or not).


> 
> - I don't think such 'for' loop can scale. The solution needs to work
> with thousands of containers and thousands of cgroups.
> In the patch 06/10 the proposal is to use 'current' as holder of
> the programs:
> +   for (prog = current->seccomp.landlock_prog;
> +                   prog; prog = prog->prev) {
> +           if (prog->filter == landlock_ret->filter) {
> +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> +                   break;
> +           }
> +   }
> imo that's the root of scalability issue.
> I think to be able to scale the bpf programs have to be attached to
> cgroups instead of tasks.
> That would be very different api. seccomp doesn't need to be touched.
> But that is the only way I see to be able to scale.

Landlock is inspired from seccomp which also use a BPF program per
thread. For seccomp, each BPF programs are executed for each syscall.
For Landlock, some BPF programs are executed for some LSM hooks. I don't
see why it is a scale issue for Landlock comparing to seccomp. I also
don't see why storing the BPF program list pointer in the cgroup struct
instead of the task struct change a lot here. The BPF programs execution
will be the same anyway (for each LSM hook). Kees should probably have a
better opinion on this.


> May be another way of thinking about it is 'lsm cgroup controller'
> that Sargun is proposing.
> The lsm hooks will provide stable execution points and the programs
> will be called like:
> prog = task_css_set(current)->dfl_cgrp->bpf.prog_effective[lsm_hook_id];
> BPF_PROG_RUN(prog, ctx);
> The delegation functionality and 'prog_effective' logic that
> Daniel Mack is proposing will be fully reused here.
> External container management software will be able to apply bpf
> programs to control tasks under cgroup and such
> bpf_landlock_cmp_cgroup_beneath() helper won't be necessary.
> The user will be able to register different programs for different lsm hooks.
> If I understand the patch 6/10 correctly, there is one (or a list) prog for
> all lsm hooks per task which is not flexible enough.

For each LSM hook triggered by a thread, all of its Landlock eBPF
programs (dedicated for this kind of hook) will be evaluated (if
needed). This is the same behavior as seccomp (list of BPF programs
attached to a process hierarchy) except the BPF programs are not
evaluated for syscall but for LSM hooks. There is no way to make it more
fine-grained :)


> Anoop Naravaram's use case is to control the ports the applications
> under cgroup can bind and listen on.
> Such use case can be solved by such 'lsm cgroup controller' by
> attaching bpf program to security_socket_bind lsm hook and
> filtering sockaddr.
> Furthermore Sargun's use case is to allow further sockaddr rewrites
> from the bpf program which can be done as natural extension
> of such mechanism.
> 
> If I understood Daniel's Anoop's Sargun's and yours use cases
> correctly the common piece of kernel infrastructure that can solve
> them all can start from Daniel's current set of patches that
> establish a mechanism of attaching bpf program to a cgroup.
> Then adding lsm hooks to it and later allowing argument rewrite
> (since they're already in the kernel and no ToCToU problems exist)

To sum up, the pieces we have in common are the eBPF use and (optionally
for Landlock) there execution depending of the current cgroup.

Moreover, the three other series (Sargun's, Anoop's and Daniel's) do not
deal with unprivileged process which is the main purpose of Landlock.
I'm not sure that allowing sockaddr rewrites is a good idea here... Like
other LSMs, Landlock is dedicated to access-control.


For the network-related series, I think it make more sense to simply
create a netfilter rule matching a cgroup and then add more features to
netfilter (restrict port ranges and so on) thanks to eBPF programs.
Containers are (usually) in a dedicated network namespace, which open
the possibility to not only rely on cgroups (e.g. match UID,
netmask...). It would also be more flexible to be able to load a BPF
program in netfilter and update its maps on the fly to make dynamic
rules, like ipset does, but in a more generic way.


> 
> As far as safety and type checking that bpf programs has to do,
> I like the approach of patch 06/10:
> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
> +       PTR_TO_STRUCT_FILE, struct file *, file,
> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
> +)
> teaching verifier to recognize struct file, cred, sockaddr
> will let bpf program access them naturally without any overhead.
> Though:
> @@ -102,6 +102,9 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
>  };
> is a bit of overkill.
> I think it would be cleaner to have single
> BPF_PROG_TYPE_LSM and at program load time pass
> lsm_hook_id as well, so that verifier can do safety checks
> based on type info provided in LANDLOCK_HOOKs

I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
verifier check programs according to their types. If we need to check
specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
dedicated program types. I don't see any other way to do it with the
current verifier code. Moreover it's the purpose of program types, right?

Regards,
 Mickaël
Tejun Heo Aug. 26, 2016, 3:50 p.m. UTC | #7
Hello,

On Fri, Aug 26, 2016 at 07:20:35AM -0700, Andy Lutomirski wrote:
> > This is simply the action of changing the owner of cgroup sysfs files to
> > allow an unprivileged user to handle them (cf. Documentation/cgroup-v2.txt)
> 
> As far as I can tell, Tejun and systemd both actively discourage doing
> this.  Maybe I misunderstand.  But in any event, the admin giving you

Please refer to "2-5. Delegation" of Documentation/cgroup-v2.txt.
Delegation on v1 is broken on both core and specific controller
behaviors and thus discouraged.  On v2, delegation should work just
fine.

I haven't looked in detail but in general I'm not too excited about
layering security mechanism on top of cgroup.  Maybe it makes some
sense when security domain coincides with resource domains but at any
rate please keep me in the loop.

Thanks.
Alexei Starovoitov Aug. 26, 2016, 11:05 p.m. UTC | #8
On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> 

trimming cc list again. When it's too big vger will consider it as spam.

> On 26/08/2016 04:14, Alexei Starovoitov wrote:
> > On Thu, Aug 25, 2016 at 12:32:44PM +0200, Mickaël Salaün wrote:
> >> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
> >> to compare the current process cgroup with a cgroup handle, The handle
> >> can match the current cgroup if it is the same or a child. This allows
> >> to make conditional rules according to the current cgroup.
> >>
> >> A cgroup handle is a map entry created from a file descriptor referring
> >> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
> >> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
> >> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
> >>
> >> An unprivileged process can create and manipulate cgroups thanks to
> >> cgroup delegation.
> >>
> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > ...
> >> +static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
> >> +		u64 r3_map_op, u64 r4, u64 r5)
> >> +{
> >> +	u8 option = (u8) r1_option;
> >> +	struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> >> +	enum bpf_map_array_op map_op = r3_map_op;
> >> +	struct bpf_array *array = container_of(map, struct bpf_array, map);
> >> +	struct cgroup *cg1, *cg2;
> >> +	struct map_landlock_handle *handle;
> >> +	int i;
> >> +
> >> +	/* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
> >> +	if (unlikely(!map)) {
> >> +		WARN_ON(1);
> >> +		return -EFAULT;
> >> +	}
> >> +	if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != _LANDLOCK_FLAG_OPT_MASK))
> >> +		return -EINVAL;
> >> +
> >> +	/* for now, only handle OP_OR */
> >> +	switch (map_op) {
> >> +	case BPF_MAP_ARRAY_OP_OR:
> >> +		break;
> >> +	case BPF_MAP_ARRAY_OP_UNSPEC:
> >> +	case BPF_MAP_ARRAY_OP_AND:
> >> +	case BPF_MAP_ARRAY_OP_XOR:
> >> +	default:
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	synchronize_rcu();
> >> +
> >> +	for (i = 0; i < array->n_entries; i++) {
> >> +		handle = (struct map_landlock_handle *)
> >> +				(array->value + array->elem_size * i);
> >> +
> >> +		/* protected by the proto types, should not happen */
> >> +		if (unlikely(handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
> >> +			WARN_ON(1);
> >> +			return -EFAULT;
> >> +		}
> >> +		if (unlikely(!handle->css)) {
> >> +			WARN_ON(1);
> >> +			return -EFAULT;
> >> +		}
> >> +
> >> +		if (option & LANDLOCK_FLAG_OPT_REVERSE) {
> >> +			cg1 = handle->css->cgroup;
> >> +			cg2 = task_css_set(current)->dfl_cgrp;
> >> +		} else {
> >> +			cg1 = task_css_set(current)->dfl_cgrp;
> >> +			cg2 = handle->css->cgroup;
> >> +		}
> >> +
> >> +		if (cgroup_is_descendant(cg1, cg2))
> >> +			return 0;
> >> +	}
> >> +	return 1;
> >> +}
> > 
> > - please take a loook at exisiting bpf_current_task_under_cgroup and
> > reuse BPF_MAP_TYPE_CGROUP_ARRAY as a minimum. Doing new cgroup array
> > is nothing but duplication of the code.
> 
> Oh, I didn't know about this patchset and the new helper. Indeed, it
> looks a lot like mine except there is no static verification of the map
> type as I did with the arraymap of handles, and no batch mode either. I
> think the return value of bpf_current_task_under_cgroup is error-prone
> if an eBPF program do an "if(ret)" test on the value (because of the
> negative ERRNO return value). Inverting the 0 and 1 return values should
> fix this (0 == succeed, 1 == failed, <0 == error).

nothing to fix. It's good as-is. Use if (ret > 0) instead.

> 
> To sum up, there is four related patchsets:
> * "Landlock LSM: Unprivileged sandboxing" (this series)
> * "Add Checmate, BPF-driven minor LSM" (Sargun Dhillon)
> * "Networking cgroup controller" (Anoop Naravaram)
> * "Add eBPF hooks for cgroups" (Daniel Mack)
> 
> The three other series (Sargun's, Anoop's and Daniel's) are mainly
> focused on network access-control via cgroup for *containers*. As far as
> I can tell, only a *root* user (CAP_SYS_ADMIN) can use them. Landlock's
> goal is to empower all processes (privileged or not) to create their own
> sandbox. This also means, like explained in "[RFC v2 00/10] Landlock
> LSM: Unprivileged sandboxing", there is more constraints. For example,
> it is not acceptable to let a process probe the kernel memory as it
> wish. More details are in the Landlock cover-letter.
> 
> 
> Another important point is that supporting cgroup for Landlock is
> optional. It does not rely on cgroup to be usable but is only a feature
> available when (unprivileged) users can manage there own cgroup, which
> is an important constraint. Put another way, Landlock should not rely on
> cgroup to create sandboxes. Indeed, a process creating a sandbox do not
> necessarily have access to the cgroup mount point (directly or not).

cgroup is the common way to group multiple tasks.
Without cgroup only parent<->child relationship will be possible,
which will limit usability of such lsm to a master task that controls
its children. Such api restriction would have been ok, if we could
extend it in the future, but unfortunately task-centric won't allow it
without creating a parallel lsm that is cgroup based.
Therefore I think we have to go with cgroup-centric api and your
application has to use cgroups from the start though only parent-child
would have been enough.
Also I don't think the kernel can afford two bpf based lsm. One task
based and another cgroup based, so we have to find common ground
that suits both use cases.
Having unprivliged access is a subset. There is no strong reason why
cgroup+lsm+bpf should be limited to root only always.
When we can guarantee no pointer leaks, we can allow unpriv.

> > 
> > - I don't think such 'for' loop can scale. The solution needs to work
> > with thousands of containers and thousands of cgroups.
> > In the patch 06/10 the proposal is to use 'current' as holder of
> > the programs:
> > +   for (prog = current->seccomp.landlock_prog;
> > +                   prog; prog = prog->prev) {
> > +           if (prog->filter == landlock_ret->filter) {
> > +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> > +                   break;
> > +           }
> > +   }
> > imo that's the root of scalability issue.
> > I think to be able to scale the bpf programs have to be attached to
> > cgroups instead of tasks.
> > That would be very different api. seccomp doesn't need to be touched.
> > But that is the only way I see to be able to scale.
> 
> Landlock is inspired from seccomp which also use a BPF program per
> thread. For seccomp, each BPF programs are executed for each syscall.
> For Landlock, some BPF programs are executed for some LSM hooks. I don't
> see why it is a scale issue for Landlock comparing to seccomp. I also
> don't see why storing the BPF program list pointer in the cgroup struct
> instead of the task struct change a lot here. The BPF programs execution
> will be the same anyway (for each LSM hook). Kees should probably have a
> better opinion on this.

seccomp has its own issues and copying them doesn't make this lsm any better.
Like seccomp bpf programs are all gigantic switch statement that looks
for interesting syscall numbers. All syscalls of a task are paying
non-trivial seccomp penalty due to such design. If bpf was attached per
syscall it would have been much cheaper. Of course doing it this way
for seccomp is not easy, but for lsm such facility is already there.
Blank call of a single bpf prog for all lsm hooks is unnecessary
overhead that can and should be avoided.

> > May be another way of thinking about it is 'lsm cgroup controller'
> > that Sargun is proposing.
> > The lsm hooks will provide stable execution points and the programs
> > will be called like:
> > prog = task_css_set(current)->dfl_cgrp->bpf.prog_effective[lsm_hook_id];
> > BPF_PROG_RUN(prog, ctx);
> > The delegation functionality and 'prog_effective' logic that
> > Daniel Mack is proposing will be fully reused here.
> > External container management software will be able to apply bpf
> > programs to control tasks under cgroup and such
> > bpf_landlock_cmp_cgroup_beneath() helper won't be necessary.
> > The user will be able to register different programs for different lsm hooks.
> > If I understand the patch 6/10 correctly, there is one (or a list) prog for
> > all lsm hooks per task which is not flexible enough.
> 
> For each LSM hook triggered by a thread, all of its Landlock eBPF
> programs (dedicated for this kind of hook) will be evaluated (if
> needed). This is the same behavior as seccomp (list of BPF programs
> attached to a process hierarchy) except the BPF programs are not
> evaluated for syscall but for LSM hooks. There is no way to make it more
> fine-grained :)

There is a way to attach different bpf program per cgroup
and per lsm hook. Such approach drastically reduces overhead
of sandboxed application.

> > Anoop Naravaram's use case is to control the ports the applications
> > under cgroup can bind and listen on.
> > Such use case can be solved by such 'lsm cgroup controller' by
> > attaching bpf program to security_socket_bind lsm hook and
> > filtering sockaddr.
> > Furthermore Sargun's use case is to allow further sockaddr rewrites
> > from the bpf program which can be done as natural extension
> > of such mechanism.
> > 
> > If I understood Daniel's Anoop's Sargun's and yours use cases
> > correctly the common piece of kernel infrastructure that can solve
> > them all can start from Daniel's current set of patches that
> > establish a mechanism of attaching bpf program to a cgroup.
> > Then adding lsm hooks to it and later allowing argument rewrite
> > (since they're already in the kernel and no ToCToU problems exist)
> 
> To sum up, the pieces we have in common are the eBPF use and (optionally
> for Landlock) there execution depending of the current cgroup.
> 
> Moreover, the three other series (Sargun's, Anoop's and Daniel's) do not
> deal with unprivileged process which is the main purpose of Landlock.
> I'm not sure that allowing sockaddr rewrites is a good idea here... Like
> other LSMs, Landlock is dedicated to access-control.

we have to find common ground and common infra to solve all these use cases.
Pointing out the differences isn't going to make this snowflake
any more special.

> For the network-related series, I think it make more sense to simply
> create a netfilter rule matching a cgroup and then add more features to
> netfilter (restrict port ranges and so on) thanks to eBPF programs.
> Containers are (usually) in a dedicated network namespace, which open
> the possibility to not only rely on cgroups (e.g. match UID,
> netmask...). It would also be more flexible to be able to load a BPF
> program in netfilter and update its maps on the fly to make dynamic
> rules, like ipset does, but in a more generic way.
> 
> 
> > 
> > As far as safety and type checking that bpf programs has to do,
> > I like the approach of patch 06/10:
> > +LANDLOCK_HOOK2(file_open, FILE_OPEN,
> > +       PTR_TO_STRUCT_FILE, struct file *, file,
> > +       PTR_TO_STRUCT_CRED, const struct cred *, cred
> > +)
> > teaching verifier to recognize struct file, cred, sockaddr
> > will let bpf program access them naturally without any overhead.
> > Though:
> > @@ -102,6 +102,9 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_SCHED_CLS,
> >         BPF_PROG_TYPE_SCHED_ACT,
> >         BPF_PROG_TYPE_TRACEPOINT,
> > +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
> > +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
> > +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
> >  };
> > is a bit of overkill.
> > I think it would be cleaner to have single
> > BPF_PROG_TYPE_LSM and at program load time pass
> > lsm_hook_id as well, so that verifier can do safety checks
> > based on type info provided in LANDLOCK_HOOKs
> 
> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
> verifier check programs according to their types. If we need to check
> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
> dedicated program types. I don't see any other way to do it with the
> current verifier code. Moreover it's the purpose of program types, right?

Adding new bpf program type for every lsm hook is not acceptable.
Either do one new program type + pass lsm_hook_id as suggested
or please come up with an alternative approach.
Andy Lutomirski Aug. 27, 2016, 7:30 a.m. UTC | #9
On Aug 27, 2016 1:05 AM, "Alexei Starovoitov"
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >
>
> trimming cc list again. When it's too big vger will consider it as spam.
>
> > On 26/08/2016 04:14, Alexei Starovoitov wrote:
> > > On Thu, Aug 25, 2016 at 12:32:44PM +0200, Mickaël Salaün wrote:
> > >> Add an eBPF function bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
> > >> to compare the current process cgroup with a cgroup handle, The handle
> > >> can match the current cgroup if it is the same or a child. This allows
> > >> to make conditional rules according to the current cgroup.
> > >>
> > >> A cgroup handle is a map entry created from a file descriptor referring
> > >> a cgroup directory (e.g. by opening /sys/fs/cgroup/X). In this case, the
> > >> map entry is of type BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD and the
> > >> inferred array map is of type BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP.
> > >>
> > >> An unprivileged process can create and manipulate cgroups thanks to
> > >> cgroup delegation.
> > >>
> > >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > > ...
> > >> +static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
> > >> +          u64 r3_map_op, u64 r4, u64 r5)
> > >> +{
> > >> +  u8 option = (u8) r1_option;
> > >> +  struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> > >> +  enum bpf_map_array_op map_op = r3_map_op;
> > >> +  struct bpf_array *array = container_of(map, struct bpf_array, map);
> > >> +  struct cgroup *cg1, *cg2;
> > >> +  struct map_landlock_handle *handle;
> > >> +  int i;
> > >> +
> > >> +  /* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
> > >> +  if (unlikely(!map)) {
> > >> +          WARN_ON(1);
> > >> +          return -EFAULT;
> > >> +  }
> > >> +  if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != _LANDLOCK_FLAG_OPT_MASK))
> > >> +          return -EINVAL;
> > >> +
> > >> +  /* for now, only handle OP_OR */
> > >> +  switch (map_op) {
> > >> +  case BPF_MAP_ARRAY_OP_OR:
> > >> +          break;
> > >> +  case BPF_MAP_ARRAY_OP_UNSPEC:
> > >> +  case BPF_MAP_ARRAY_OP_AND:
> > >> +  case BPF_MAP_ARRAY_OP_XOR:
> > >> +  default:
> > >> +          return -EINVAL;
> > >> +  }
> > >> +
> > >> +  synchronize_rcu();
> > >> +
> > >> +  for (i = 0; i < array->n_entries; i++) {
> > >> +          handle = (struct map_landlock_handle *)
> > >> +                          (array->value + array->elem_size * i);
> > >> +
> > >> +          /* protected by the proto types, should not happen */
> > >> +          if (unlikely(handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
> > >> +                  WARN_ON(1);
> > >> +                  return -EFAULT;
> > >> +          }
> > >> +          if (unlikely(!handle->css)) {
> > >> +                  WARN_ON(1);
> > >> +                  return -EFAULT;
> > >> +          }
> > >> +
> > >> +          if (option & LANDLOCK_FLAG_OPT_REVERSE) {
> > >> +                  cg1 = handle->css->cgroup;
> > >> +                  cg2 = task_css_set(current)->dfl_cgrp;
> > >> +          } else {
> > >> +                  cg1 = task_css_set(current)->dfl_cgrp;
> > >> +                  cg2 = handle->css->cgroup;
> > >> +          }
> > >> +
> > >> +          if (cgroup_is_descendant(cg1, cg2))
> > >> +                  return 0;
> > >> +  }
> > >> +  return 1;
> > >> +}
> > >
> > > - please take a loook at exisiting bpf_current_task_under_cgroup and
> > > reuse BPF_MAP_TYPE_CGROUP_ARRAY as a minimum. Doing new cgroup array
> > > is nothing but duplication of the code.
> >
> > Oh, I didn't know about this patchset and the new helper. Indeed, it
> > looks a lot like mine except there is no static verification of the map
> > type as I did with the arraymap of handles, and no batch mode either. I
> > think the return value of bpf_current_task_under_cgroup is error-prone
> > if an eBPF program do an "if(ret)" test on the value (because of the
> > negative ERRNO return value). Inverting the 0 and 1 return values should
> > fix this (0 == succeed, 1 == failed, <0 == error).
>
> nothing to fix. It's good as-is. Use if (ret > 0) instead.
>
> >
> > To sum up, there is four related patchsets:
> > * "Landlock LSM: Unprivileged sandboxing" (this series)
> > * "Add Checmate, BPF-driven minor LSM" (Sargun Dhillon)
> > * "Networking cgroup controller" (Anoop Naravaram)
> > * "Add eBPF hooks for cgroups" (Daniel Mack)
> >
> > The three other series (Sargun's, Anoop's and Daniel's) are mainly
> > focused on network access-control via cgroup for *containers*. As far as
> > I can tell, only a *root* user (CAP_SYS_ADMIN) can use them. Landlock's
> > goal is to empower all processes (privileged or not) to create their own
> > sandbox. This also means, like explained in "[RFC v2 00/10] Landlock
> > LSM: Unprivileged sandboxing", there is more constraints. For example,
> > it is not acceptable to let a process probe the kernel memory as it
> > wish. More details are in the Landlock cover-letter.
> >
> >
> > Another important point is that supporting cgroup for Landlock is
> > optional. It does not rely on cgroup to be usable but is only a feature
> > available when (unprivileged) users can manage there own cgroup, which
> > is an important constraint. Put another way, Landlock should not rely on
> > cgroup to create sandboxes. Indeed, a process creating a sandbox do not
> > necessarily have access to the cgroup mount point (directly or not).
>
> cgroup is the common way to group multiple tasks.
> Without cgroup only parent<->child relationship will be possible,
> which will limit usability of such lsm to a master task that controls
> its children. Such api restriction would have been ok, if we could
> extend it in the future, but unfortunately task-centric won't allow it
> without creating a parallel lsm that is cgroup based.
> Therefore I think we have to go with cgroup-centric api and your
> application has to use cgroups from the start though only parent-child
> would have been enough.
> Also I don't think the kernel can afford two bpf based lsm. One task
> based and another cgroup based, so we have to find common ground
> that suits both use cases.
> Having unprivliged access is a subset. There is no strong reason why
> cgroup+lsm+bpf should be limited to root only always.
> When we can guarantee no pointer leaks, we can allow unpriv.

I don't really understand what you mean.  In the context of landlock,
which is a *sandbox*, can one of you explain a use case that
materially benefits from this type of cgroup usage?  I haven't thought
of one.
Mickaël Salaün Aug. 27, 2016, 2:06 p.m. UTC | #10
On 27/08/2016 01:05, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>>
>>>
>>> - I don't think such 'for' loop can scale. The solution needs to work
>>> with thousands of containers and thousands of cgroups.
>>> In the patch 06/10 the proposal is to use 'current' as holder of
>>> the programs:
>>> +   for (prog = current->seccomp.landlock_prog;
>>> +                   prog; prog = prog->prev) {
>>> +           if (prog->filter == landlock_ret->filter) {
>>> +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
>>> +                   break;
>>> +           }
>>> +   }
>>> imo that's the root of scalability issue.
>>> I think to be able to scale the bpf programs have to be attached to
>>> cgroups instead of tasks.
>>> That would be very different api. seccomp doesn't need to be touched.
>>> But that is the only way I see to be able to scale.
>>
>> Landlock is inspired from seccomp which also use a BPF program per
>> thread. For seccomp, each BPF programs are executed for each syscall.
>> For Landlock, some BPF programs are executed for some LSM hooks. I don't
>> see why it is a scale issue for Landlock comparing to seccomp. I also
>> don't see why storing the BPF program list pointer in the cgroup struct
>> instead of the task struct change a lot here. The BPF programs execution
>> will be the same anyway (for each LSM hook). Kees should probably have a
>> better opinion on this.
> 
> seccomp has its own issues and copying them doesn't make this lsm any better.
> Like seccomp bpf programs are all gigantic switch statement that looks
> for interesting syscall numbers. All syscalls of a task are paying
> non-trivial seccomp penalty due to such design. If bpf was attached per
> syscall it would have been much cheaper. Of course doing it this way
> for seccomp is not easy, but for lsm such facility is already there.
> Blank call of a single bpf prog for all lsm hooks is unnecessary
> overhead that can and should be avoided.

It's probably a misunderstanding. Contrary to seccomp which run all the
thread's BPF programs for any syscall, Landlock only run eBPF programs
for the triggered LSM hooks, if their type match. Indeed, thanks to the
multiple eBPF program types and contrary to seccomp, Landlock only run
an eBPF program when needed. Landlock will have almost no performance
overhead if the syscalls do not trigger the watched LSM hooks for the
current process.


> 
>>> May be another way of thinking about it is 'lsm cgroup controller'
>>> that Sargun is proposing.
>>> The lsm hooks will provide stable execution points and the programs
>>> will be called like:
>>> prog = task_css_set(current)->dfl_cgrp->bpf.prog_effective[lsm_hook_id];
>>> BPF_PROG_RUN(prog, ctx);
>>> The delegation functionality and 'prog_effective' logic that
>>> Daniel Mack is proposing will be fully reused here.
>>> External container management software will be able to apply bpf
>>> programs to control tasks under cgroup and such
>>> bpf_landlock_cmp_cgroup_beneath() helper won't be necessary.
>>> The user will be able to register different programs for different lsm hooks.
>>> If I understand the patch 6/10 correctly, there is one (or a list) prog for
>>> all lsm hooks per task which is not flexible enough.
>>
>> For each LSM hook triggered by a thread, all of its Landlock eBPF
>> programs (dedicated for this kind of hook) will be evaluated (if
>> needed). This is the same behavior as seccomp (list of BPF programs
>> attached to a process hierarchy) except the BPF programs are not
>> evaluated for syscall but for LSM hooks. There is no way to make it more
>> fine-grained :)
> 
> There is a way to attach different bpf program per cgroup
> and per lsm hook. Such approach drastically reduces overhead
> of sandboxed application.

As said above, Landlock will not run an eBPF programs when not strictly
needed. Attaching to a cgroup will have the same performance impact as
attaching to a process hierarchy.
Mickaël Salaün Aug. 27, 2016, 2:19 p.m. UTC | #11
On 27/08/2016 01:05, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>> To sum up, there is four related patchsets:
>> * "Landlock LSM: Unprivileged sandboxing" (this series)
>> * "Add Checmate, BPF-driven minor LSM" (Sargun Dhillon)
>> * "Networking cgroup controller" (Anoop Naravaram)
>> * "Add eBPF hooks for cgroups" (Daniel Mack)

>>> Anoop Naravaram's use case is to control the ports the applications
>>> under cgroup can bind and listen on.
>>> Such use case can be solved by such 'lsm cgroup controller' by
>>> attaching bpf program to security_socket_bind lsm hook and
>>> filtering sockaddr.
>>> Furthermore Sargun's use case is to allow further sockaddr rewrites
>>> from the bpf program which can be done as natural extension
>>> of such mechanism.
>>>
>>> If I understood Daniel's Anoop's Sargun's and yours use cases
>>> correctly the common piece of kernel infrastructure that can solve
>>> them all can start from Daniel's current set of patches that
>>> establish a mechanism of attaching bpf program to a cgroup.
>>> Then adding lsm hooks to it and later allowing argument rewrite
>>> (since they're already in the kernel and no ToCToU problems exist)

>> For the network-related series, I think it make more sense to simply
>> create a netfilter rule matching a cgroup and then add more features to
>> netfilter (restrict port ranges and so on) thanks to eBPF programs.
>> Containers are (usually) in a dedicated network namespace, which open
>> the possibility to not only rely on cgroups (e.g. match UID,
>> netmask...). It would also be more flexible to be able to load a BPF
>> program in netfilter and update its maps on the fly to make dynamic
>> rules, like ipset does, but in a more generic way.

What do the netdev folks think about this design?
Mickaël Salaün Aug. 27, 2016, 2:34 p.m. UTC | #12
On 27/08/2016 01:05, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>
>>> As far as safety and type checking that bpf programs has to do,
>>> I like the approach of patch 06/10:
>>> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
>>> +       PTR_TO_STRUCT_FILE, struct file *, file,
>>> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
>>> +)
>>> teaching verifier to recognize struct file, cred, sockaddr
>>> will let bpf program access them naturally without any overhead.
>>> Though:
>>> @@ -102,6 +102,9 @@ enum bpf_prog_type {
>>>         BPF_PROG_TYPE_SCHED_CLS,
>>>         BPF_PROG_TYPE_SCHED_ACT,
>>>         BPF_PROG_TYPE_TRACEPOINT,
>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
>>> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
>>>  };
>>> is a bit of overkill.
>>> I think it would be cleaner to have single
>>> BPF_PROG_TYPE_LSM and at program load time pass
>>> lsm_hook_id as well, so that verifier can do safety checks
>>> based on type info provided in LANDLOCK_HOOKs
>>
>> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
>> verifier check programs according to their types. If we need to check
>> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
>> dedicated program types. I don't see any other way to do it with the
>> current verifier code. Moreover it's the purpose of program types, right?
> 
> Adding new bpf program type for every lsm hook is not acceptable.
> Either do one new program type + pass lsm_hook_id as suggested
> or please come up with an alternative approach.

OK, so we have to modify the verifier to not only rely on the program
type but on another value to check the context accesses. Do you have a
hint from where this value could come from? Do we need to add a new bpf
command to associate a program to a subtype?
Alexei Starovoitov Aug. 27, 2016, 6:06 p.m. UTC | #13
On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> 
> On 27/08/2016 01:05, Alexei Starovoitov wrote:
> > On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >>
> >>>
> >>> - I don't think such 'for' loop can scale. The solution needs to work
> >>> with thousands of containers and thousands of cgroups.
> >>> In the patch 06/10 the proposal is to use 'current' as holder of
> >>> the programs:
> >>> +   for (prog = current->seccomp.landlock_prog;
> >>> +                   prog; prog = prog->prev) {
> >>> +           if (prog->filter == landlock_ret->filter) {
> >>> +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> >>> +                   break;
> >>> +           }
> >>> +   }
> >>> imo that's the root of scalability issue.
> >>> I think to be able to scale the bpf programs have to be attached to
> >>> cgroups instead of tasks.
> >>> That would be very different api. seccomp doesn't need to be touched.
> >>> But that is the only way I see to be able to scale.
> >>
> >> Landlock is inspired from seccomp which also use a BPF program per
> >> thread. For seccomp, each BPF programs are executed for each syscall.
> >> For Landlock, some BPF programs are executed for some LSM hooks. I don't
> >> see why it is a scale issue for Landlock comparing to seccomp. I also
> >> don't see why storing the BPF program list pointer in the cgroup struct
> >> instead of the task struct change a lot here. The BPF programs execution
> >> will be the same anyway (for each LSM hook). Kees should probably have a
> >> better opinion on this.
> > 
> > seccomp has its own issues and copying them doesn't make this lsm any better.
> > Like seccomp bpf programs are all gigantic switch statement that looks
> > for interesting syscall numbers. All syscalls of a task are paying
> > non-trivial seccomp penalty due to such design. If bpf was attached per
> > syscall it would have been much cheaper. Of course doing it this way
> > for seccomp is not easy, but for lsm such facility is already there.
> > Blank call of a single bpf prog for all lsm hooks is unnecessary
> > overhead that can and should be avoided.
> 
> It's probably a misunderstanding. Contrary to seccomp which run all the
> thread's BPF programs for any syscall, Landlock only run eBPF programs
> for the triggered LSM hooks, if their type match. Indeed, thanks to the
> multiple eBPF program types and contrary to seccomp, Landlock only run
> an eBPF program when needed. Landlock will have almost no performance
> overhead if the syscalls do not trigger the watched LSM hooks for the
> current process.

that's not what I see in the patch 06/10:
all lsm_hooks in 'static struct security_hook_list landlock_hooks'
(which eventually means all lsm hooks) will call
static inline int landlock_hook_##NAME
which will call landlock_run_prog()
which does:
+ for (landlock_ret = current->seccomp.landlock_ret;
+      landlock_ret; landlock_ret = landlock_ret->prev) {
+    if (landlock_ret->triggered) {
+       ctx.cookie = landlock_ret->cookie;
+       for (prog = current->seccomp.landlock_prog;
+            prog; prog = prog->prev) {
+               if (prog->filter == landlock_ret->filter) {
+                       cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
+                       break;
+               }
+       }

that is unacceptable overhead and not a scalable design.
It kinda works for 3 lsm_hooks as in patch 6, but doesn't scale
as soon as more lsm hooks are added.

> As said above, Landlock will not run an eBPF programs when not strictly
> needed. Attaching to a cgroup will have the same performance impact as
> attaching to a process hierarchy.

Having a prog per cgroup per lsm_hook is the only scalable way I
could come up with. If you see another way, please propose.
current->seccomp.landlock_prog is not the answer.
Alexei Starovoitov Aug. 27, 2016, 6:11 p.m. UTC | #14
On Sat, Aug 27, 2016 at 12:30:36AM -0700, Andy Lutomirski wrote:
> > cgroup is the common way to group multiple tasks.
> > Without cgroup only parent<->child relationship will be possible,
> > which will limit usability of such lsm to a master task that controls
> > its children. Such api restriction would have been ok, if we could
> > extend it in the future, but unfortunately task-centric won't allow it
> > without creating a parallel lsm that is cgroup based.
> > Therefore I think we have to go with cgroup-centric api and your
> > application has to use cgroups from the start though only parent-child
> > would have been enough.
> > Also I don't think the kernel can afford two bpf based lsm. One task
> > based and another cgroup based, so we have to find common ground
> > that suits both use cases.
> > Having unprivliged access is a subset. There is no strong reason why
> > cgroup+lsm+bpf should be limited to root only always.
> > When we can guarantee no pointer leaks, we can allow unpriv.
> 
> I don't really understand what you mean.  In the context of landlock,
> which is a *sandbox*, can one of you explain a use case that
> materially benefits from this type of cgroup usage?  I haven't thought
> of one.

In case of seccomp-like sandbox where parent controls child processes
cgroup is not needed. It's needed when container management software
needs to control a set of applications. If we can have one bpf-based lsm
that works via cgroup and without, I'd be fine with it. Right now
I haven't seen a plausible proposal to do that. Therefore cgroup based
api is a common api that works for sandbox as well, though requiring
parent to create a cgroup just to control a single child is cumbersome.
Alexei Starovoitov Aug. 27, 2016, 6:19 p.m. UTC | #15
On Sat, Aug 27, 2016 at 04:34:55PM +0200, Mickaël Salaün wrote:
> 
> On 27/08/2016 01:05, Alexei Starovoitov wrote:
> > On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >
> >>> As far as safety and type checking that bpf programs has to do,
> >>> I like the approach of patch 06/10:
> >>> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
> >>> +       PTR_TO_STRUCT_FILE, struct file *, file,
> >>> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
> >>> +)
> >>> teaching verifier to recognize struct file, cred, sockaddr
> >>> will let bpf program access them naturally without any overhead.
> >>> Though:
> >>> @@ -102,6 +102,9 @@ enum bpf_prog_type {
> >>>         BPF_PROG_TYPE_SCHED_CLS,
> >>>         BPF_PROG_TYPE_SCHED_ACT,
> >>>         BPF_PROG_TYPE_TRACEPOINT,
> >>> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
> >>> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
> >>> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
> >>>  };
> >>> is a bit of overkill.
> >>> I think it would be cleaner to have single
> >>> BPF_PROG_TYPE_LSM and at program load time pass
> >>> lsm_hook_id as well, so that verifier can do safety checks
> >>> based on type info provided in LANDLOCK_HOOKs
> >>
> >> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
> >> verifier check programs according to their types. If we need to check
> >> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
> >> dedicated program types. I don't see any other way to do it with the
> >> current verifier code. Moreover it's the purpose of program types, right?
> > 
> > Adding new bpf program type for every lsm hook is not acceptable.
> > Either do one new program type + pass lsm_hook_id as suggested
> > or please come up with an alternative approach.
> 
> OK, so we have to modify the verifier to not only rely on the program
> type but on another value to check the context accesses. Do you have a
> hint from where this value could come from? Do we need to add a new bpf
> command to associate a program to a subtype?

It's another field prog_subtype (or prog_hook_id) in union bpf_attr.
Both prog_type and prog_hook_id are used during verification.
prog_type distinguishes the main aspects whereas prog_hook_id selects
which lsm_hook's argument definition to apply.
At the time of attaching to a hook, the prog_hook_id passed at the
load time should match lsm's hook_id.
Alexei Starovoitov Aug. 27, 2016, 6:32 p.m. UTC | #16
On Sat, Aug 27, 2016 at 04:19:05PM +0200, Mickaël Salaün wrote:
> 
> On 27/08/2016 01:05, Alexei Starovoitov wrote:
> > On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >> To sum up, there is four related patchsets:
> >> * "Landlock LSM: Unprivileged sandboxing" (this series)
> >> * "Add Checmate, BPF-driven minor LSM" (Sargun Dhillon)
> >> * "Networking cgroup controller" (Anoop Naravaram)
> >> * "Add eBPF hooks for cgroups" (Daniel Mack)
> 
> >>> Anoop Naravaram's use case is to control the ports the applications
> >>> under cgroup can bind and listen on.
> >>> Such use case can be solved by such 'lsm cgroup controller' by
> >>> attaching bpf program to security_socket_bind lsm hook and
> >>> filtering sockaddr.
> >>> Furthermore Sargun's use case is to allow further sockaddr rewrites
> >>> from the bpf program which can be done as natural extension
> >>> of such mechanism.
> >>>
> >>> If I understood Daniel's Anoop's Sargun's and yours use cases
> >>> correctly the common piece of kernel infrastructure that can solve
> >>> them all can start from Daniel's current set of patches that
> >>> establish a mechanism of attaching bpf program to a cgroup.
> >>> Then adding lsm hooks to it and later allowing argument rewrite
> >>> (since they're already in the kernel and no ToCToU problems exist)
> 
> >> For the network-related series, I think it make more sense to simply
> >> create a netfilter rule matching a cgroup and then add more features to
> >> netfilter (restrict port ranges and so on) thanks to eBPF programs.
> >> Containers are (usually) in a dedicated network namespace, which open
> >> the possibility to not only rely on cgroups (e.g. match UID,
> >> netmask...). It would also be more flexible to be able to load a BPF
> >> program in netfilter and update its maps on the fly to make dynamic
> >> rules, like ipset does, but in a more generic way.
> 
> What do the netdev folks think about this design?

such design doesn't scale when used for container management and
that's what we need to solve.
netns has its overhead and management issues. There are proposals to
solve that but that is orthogonal to containers in general.
A lot of them don't use netns.
Mickaël Salaün Aug. 27, 2016, 7:35 p.m. UTC | #17
On 27/08/2016 20:06, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>
>> On 27/08/2016 01:05, Alexei Starovoitov wrote:
>>> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>>>>
>>>>>
>>>>> - I don't think such 'for' loop can scale. The solution needs to work
>>>>> with thousands of containers and thousands of cgroups.
>>>>> In the patch 06/10 the proposal is to use 'current' as holder of
>>>>> the programs:
>>>>> +   for (prog = current->seccomp.landlock_prog;
>>>>> +                   prog; prog = prog->prev) {
>>>>> +           if (prog->filter == landlock_ret->filter) {
>>>>> +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
>>>>> +                   break;
>>>>> +           }
>>>>> +   }
>>>>> imo that's the root of scalability issue.
>>>>> I think to be able to scale the bpf programs have to be attached to
>>>>> cgroups instead of tasks.
>>>>> That would be very different api. seccomp doesn't need to be touched.
>>>>> But that is the only way I see to be able to scale.
>>>>
>>>> Landlock is inspired from seccomp which also use a BPF program per
>>>> thread. For seccomp, each BPF programs are executed for each syscall.
>>>> For Landlock, some BPF programs are executed for some LSM hooks. I don't
>>>> see why it is a scale issue for Landlock comparing to seccomp. I also
>>>> don't see why storing the BPF program list pointer in the cgroup struct
>>>> instead of the task struct change a lot here. The BPF programs execution
>>>> will be the same anyway (for each LSM hook). Kees should probably have a
>>>> better opinion on this.
>>>
>>> seccomp has its own issues and copying them doesn't make this lsm any better.
>>> Like seccomp bpf programs are all gigantic switch statement that looks
>>> for interesting syscall numbers. All syscalls of a task are paying
>>> non-trivial seccomp penalty due to such design. If bpf was attached per
>>> syscall it would have been much cheaper. Of course doing it this way
>>> for seccomp is not easy, but for lsm such facility is already there.
>>> Blank call of a single bpf prog for all lsm hooks is unnecessary
>>> overhead that can and should be avoided.
>>
>> It's probably a misunderstanding. Contrary to seccomp which run all the
>> thread's BPF programs for any syscall, Landlock only run eBPF programs
>> for the triggered LSM hooks, if their type match. Indeed, thanks to the
>> multiple eBPF program types and contrary to seccomp, Landlock only run
>> an eBPF program when needed. Landlock will have almost no performance
>> overhead if the syscalls do not trigger the watched LSM hooks for the
>> current process.
> 
> that's not what I see in the patch 06/10:
> all lsm_hooks in 'static struct security_hook_list landlock_hooks'
> (which eventually means all lsm hooks) will call
> static inline int landlock_hook_##NAME
> which will call landlock_run_prog()
> which does:
> + for (landlock_ret = current->seccomp.landlock_ret;
> +      landlock_ret; landlock_ret = landlock_ret->prev) {
> +    if (landlock_ret->triggered) {
> +       ctx.cookie = landlock_ret->cookie;
> +       for (prog = current->seccomp.landlock_prog;
> +            prog; prog = prog->prev) {
> +               if (prog->filter == landlock_ret->filter) {
> +                       cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> +                       break;
> +               }
> +       }
> 
> that is unacceptable overhead and not a scalable design.
> It kinda works for 3 lsm_hooks as in patch 6, but doesn't scale
> as soon as more lsm hooks are added.

Good catch! I forgot to check the program (sub)type in the loop to only
run the needed programs for the current hook. I will fix this.


> 
>> As said above, Landlock will not run an eBPF programs when not strictly
>> needed. Attaching to a cgroup will have the same performance impact as
>> attaching to a process hierarchy.
> 
> Having a prog per cgroup per lsm_hook is the only scalable way I
> could come up with. If you see another way, please propose.
> current->seccomp.landlock_prog is not the answer.

Hum, I don't see the difference from a performance point of view between
a cgroup-based or a process hierarchy-based system.

Maybe a better option should be to use an array of pointers with N
entries, one for each supported hook, instead of a unique pointer list?

Anyway, being able to attach an LSM hook program to a cgroup thanks to
the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
to use a process hierarchy). The downside will be to handle an LSM hook
program which is not triggered by a seccomp-filter, but this should be
needed anyway to handle interruptions.
Mickaël Salaün Aug. 27, 2016, 7:55 p.m. UTC | #18
On 27/08/2016 20:19, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 04:34:55PM +0200, Mickaël Salaün wrote:
>>
>> On 27/08/2016 01:05, Alexei Starovoitov wrote:
>>> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>>>
>>>>> As far as safety and type checking that bpf programs has to do,
>>>>> I like the approach of patch 06/10:
>>>>> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
>>>>> +       PTR_TO_STRUCT_FILE, struct file *, file,
>>>>> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
>>>>> +)
>>>>> teaching verifier to recognize struct file, cred, sockaddr
>>>>> will let bpf program access them naturally without any overhead.
>>>>> Though:
>>>>> @@ -102,6 +102,9 @@ enum bpf_prog_type {
>>>>>         BPF_PROG_TYPE_SCHED_CLS,
>>>>>         BPF_PROG_TYPE_SCHED_ACT,
>>>>>         BPF_PROG_TYPE_TRACEPOINT,
>>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
>>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
>>>>> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
>>>>>  };
>>>>> is a bit of overkill.
>>>>> I think it would be cleaner to have single
>>>>> BPF_PROG_TYPE_LSM and at program load time pass
>>>>> lsm_hook_id as well, so that verifier can do safety checks
>>>>> based on type info provided in LANDLOCK_HOOKs
>>>>
>>>> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
>>>> verifier check programs according to their types. If we need to check
>>>> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
>>>> dedicated program types. I don't see any other way to do it with the
>>>> current verifier code. Moreover it's the purpose of program types, right?
>>>
>>> Adding new bpf program type for every lsm hook is not acceptable.
>>> Either do one new program type + pass lsm_hook_id as suggested
>>> or please come up with an alternative approach.
>>
>> OK, so we have to modify the verifier to not only rely on the program
>> type but on another value to check the context accesses. Do you have a
>> hint from where this value could come from? Do we need to add a new bpf
>> command to associate a program to a subtype?
> 
> It's another field prog_subtype (or prog_hook_id) in union bpf_attr.
> Both prog_type and prog_hook_id are used during verification.
> prog_type distinguishes the main aspects whereas prog_hook_id selects
> which lsm_hook's argument definition to apply.
> At the time of attaching to a hook, the prog_hook_id passed at the
> load time should match lsm's hook_id.

OK, so this new prog_subtype field should be use/set by a new bpf_cmd,
right? Something like BPF_PROG_SUBTYPE or BPF_PROG_METADATA?
Alexei Starovoitov Aug. 27, 2016, 8:43 p.m. UTC | #19
On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> 
> On 27/08/2016 20:06, Alexei Starovoitov wrote:
> > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> >>
> >> On 27/08/2016 01:05, Alexei Starovoitov wrote:
> >>> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >>>>
> >>>>>
> >>>>> - I don't think such 'for' loop can scale. The solution needs to work
> >>>>> with thousands of containers and thousands of cgroups.
> >>>>> In the patch 06/10 the proposal is to use 'current' as holder of
> >>>>> the programs:
> >>>>> +   for (prog = current->seccomp.landlock_prog;
> >>>>> +                   prog; prog = prog->prev) {
> >>>>> +           if (prog->filter == landlock_ret->filter) {
> >>>>> +                   cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> >>>>> +                   break;
> >>>>> +           }
> >>>>> +   }
> >>>>> imo that's the root of scalability issue.
> >>>>> I think to be able to scale the bpf programs have to be attached to
> >>>>> cgroups instead of tasks.
> >>>>> That would be very different api. seccomp doesn't need to be touched.
> >>>>> But that is the only way I see to be able to scale.
> >>>>
> >>>> Landlock is inspired from seccomp which also use a BPF program per
> >>>> thread. For seccomp, each BPF programs are executed for each syscall.
> >>>> For Landlock, some BPF programs are executed for some LSM hooks. I don't
> >>>> see why it is a scale issue for Landlock comparing to seccomp. I also
> >>>> don't see why storing the BPF program list pointer in the cgroup struct
> >>>> instead of the task struct change a lot here. The BPF programs execution
> >>>> will be the same anyway (for each LSM hook). Kees should probably have a
> >>>> better opinion on this.
> >>>
> >>> seccomp has its own issues and copying them doesn't make this lsm any better.
> >>> Like seccomp bpf programs are all gigantic switch statement that looks
> >>> for interesting syscall numbers. All syscalls of a task are paying
> >>> non-trivial seccomp penalty due to such design. If bpf was attached per
> >>> syscall it would have been much cheaper. Of course doing it this way
> >>> for seccomp is not easy, but for lsm such facility is already there.
> >>> Blank call of a single bpf prog for all lsm hooks is unnecessary
> >>> overhead that can and should be avoided.
> >>
> >> It's probably a misunderstanding. Contrary to seccomp which run all the
> >> thread's BPF programs for any syscall, Landlock only run eBPF programs
> >> for the triggered LSM hooks, if their type match. Indeed, thanks to the
> >> multiple eBPF program types and contrary to seccomp, Landlock only run
> >> an eBPF program when needed. Landlock will have almost no performance
> >> overhead if the syscalls do not trigger the watched LSM hooks for the
> >> current process.
> > 
> > that's not what I see in the patch 06/10:
> > all lsm_hooks in 'static struct security_hook_list landlock_hooks'
> > (which eventually means all lsm hooks) will call
> > static inline int landlock_hook_##NAME
> > which will call landlock_run_prog()
> > which does:
> > + for (landlock_ret = current->seccomp.landlock_ret;
> > +      landlock_ret; landlock_ret = landlock_ret->prev) {
> > +    if (landlock_ret->triggered) {
> > +       ctx.cookie = landlock_ret->cookie;
> > +       for (prog = current->seccomp.landlock_prog;
> > +            prog; prog = prog->prev) {
> > +               if (prog->filter == landlock_ret->filter) {
> > +                       cur_ret = BPF_PROG_RUN(prog->prog, (void *)&ctx);
> > +                       break;
> > +               }
> > +       }
> > 
> > that is unacceptable overhead and not a scalable design.
> > It kinda works for 3 lsm_hooks as in patch 6, but doesn't scale
> > as soon as more lsm hooks are added.
> 
> Good catch! I forgot to check the program (sub)type in the loop to only
> run the needed programs for the current hook. I will fix this.
> 
> 
> > 
> >> As said above, Landlock will not run an eBPF programs when not strictly
> >> needed. Attaching to a cgroup will have the same performance impact as
> >> attaching to a process hierarchy.
> > 
> > Having a prog per cgroup per lsm_hook is the only scalable way I
> > could come up with. If you see another way, please propose.
> > current->seccomp.landlock_prog is not the answer.
> 
> Hum, I don't see the difference from a performance point of view between
> a cgroup-based or a process hierarchy-based system.
> 
> Maybe a better option should be to use an array of pointers with N
> entries, one for each supported hook, instead of a unique pointer list?

yes, clearly array dereference is faster than link list walk.
Now the question is where to keep this prog_array[num_lsm_hooks] ?
Since we cannot keep it inside task_struct, we have to allocate it.
Every time the task is creted then. What to do on the fork? That
will require changes all over. Then the obvious optimization would be
to share this allocated array of prog pointers across multiple tasks...
and little by little this new facility will look like cgroup.
Hence the suggestion to put this array into cgroup from the start.

> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> to use a process hierarchy). The downside will be to handle an LSM hook
> program which is not triggered by a seccomp-filter, but this should be
> needed anyway to handle interruptions.

what do you mean 'not triggered by seccomp' ?
You're not suggesting that this lsm has to enable seccomp to be functional?
imo that's non starter due to overhead.
Alexei Starovoitov Aug. 27, 2016, 8:56 p.m. UTC | #20
On Sat, Aug 27, 2016 at 09:55:01PM +0200, Mickaël Salaün wrote:
> 
> On 27/08/2016 20:19, Alexei Starovoitov wrote:
> > On Sat, Aug 27, 2016 at 04:34:55PM +0200, Mickaël Salaün wrote:
> >>
> >> On 27/08/2016 01:05, Alexei Starovoitov wrote:
> >>> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
> >>>
> >>>>> As far as safety and type checking that bpf programs has to do,
> >>>>> I like the approach of patch 06/10:
> >>>>> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
> >>>>> +       PTR_TO_STRUCT_FILE, struct file *, file,
> >>>>> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
> >>>>> +)
> >>>>> teaching verifier to recognize struct file, cred, sockaddr
> >>>>> will let bpf program access them naturally without any overhead.
> >>>>> Though:
> >>>>> @@ -102,6 +102,9 @@ enum bpf_prog_type {
> >>>>>         BPF_PROG_TYPE_SCHED_CLS,
> >>>>>         BPF_PROG_TYPE_SCHED_ACT,
> >>>>>         BPF_PROG_TYPE_TRACEPOINT,
> >>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
> >>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
> >>>>> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
> >>>>>  };
> >>>>> is a bit of overkill.
> >>>>> I think it would be cleaner to have single
> >>>>> BPF_PROG_TYPE_LSM and at program load time pass
> >>>>> lsm_hook_id as well, so that verifier can do safety checks
> >>>>> based on type info provided in LANDLOCK_HOOKs
> >>>>
> >>>> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
> >>>> verifier check programs according to their types. If we need to check
> >>>> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
> >>>> dedicated program types. I don't see any other way to do it with the
> >>>> current verifier code. Moreover it's the purpose of program types, right?
> >>>
> >>> Adding new bpf program type for every lsm hook is not acceptable.
> >>> Either do one new program type + pass lsm_hook_id as suggested
> >>> or please come up with an alternative approach.
> >>
> >> OK, so we have to modify the verifier to not only rely on the program
> >> type but on another value to check the context accesses. Do you have a
> >> hint from where this value could come from? Do we need to add a new bpf
> >> command to associate a program to a subtype?
> > 
> > It's another field prog_subtype (or prog_hook_id) in union bpf_attr.
> > Both prog_type and prog_hook_id are used during verification.
> > prog_type distinguishes the main aspects whereas prog_hook_id selects
> > which lsm_hook's argument definition to apply.
> > At the time of attaching to a hook, the prog_hook_id passed at the
> > load time should match lsm's hook_id.
> 
> OK, so this new prog_subtype field should be use/set by a new bpf_cmd,
> right? Something like BPF_PROG_SUBTYPE or BPF_PROG_METADATA?

No new command. It will be an optional field to existing BPF_PROG_LOAD.
In other words instead of replicating everything that different
bpf_prog_type-s need, we can pass this hook_id field to fine tune
the purpose (and applicability) of the program.
And one BPF_PROG_TYPE_LANDLOCK type will cover all lsm hooks.
For example existing BPF_PROG_TYPE_TRACEPOINT checks the safety
for all tracepoints while they all have different arguments.
For tracepoints it's easier, since the only difference between them
is the range of ctx access. Here we need strong type safety
of arguments therefore need extra hook_id at load time.
Mickaël Salaün Aug. 27, 2016, 9:14 p.m. UTC | #21
On 27/08/2016 22:43, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>> attaching to a process hierarchy.
>>>
>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>> could come up with. If you see another way, please propose.
>>> current->seccomp.landlock_prog is not the answer.
>>
>> Hum, I don't see the difference from a performance point of view between
>> a cgroup-based or a process hierarchy-based system.
>>
>> Maybe a better option should be to use an array of pointers with N
>> entries, one for each supported hook, instead of a unique pointer list?
> 
> yes, clearly array dereference is faster than link list walk.
> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> Since we cannot keep it inside task_struct, we have to allocate it.
> Every time the task is creted then. What to do on the fork? That
> will require changes all over. Then the obvious optimization would be
> to share this allocated array of prog pointers across multiple tasks...
> and little by little this new facility will look like cgroup.
> Hence the suggestion to put this array into cgroup from the start.

I see your point :)

> 
>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>> to use a process hierarchy). The downside will be to handle an LSM hook
>> program which is not triggered by a seccomp-filter, but this should be
>> needed anyway to handle interruptions.
> 
> what do you mean 'not triggered by seccomp' ?
> You're not suggesting that this lsm has to enable seccomp to be functional?
> imo that's non starter due to overhead.

Yes, for now, it is triggered by a new seccomp filter return value
RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
be needed but could be useful to bind a seccomp filter security policy
with a Landlock one. Waiting for Kees's point of view…
Mickaël Salaün Aug. 27, 2016, 9:18 p.m. UTC | #22
On 27/08/2016 22:56, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 09:55:01PM +0200, Mickaël Salaün wrote:
>>
>> On 27/08/2016 20:19, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 04:34:55PM +0200, Mickaël Salaün wrote:
>>>>
>>>> On 27/08/2016 01:05, Alexei Starovoitov wrote:
>>>>> On Fri, Aug 26, 2016 at 05:10:40PM +0200, Mickaël Salaün wrote:
>>>>>
>>>>>>> As far as safety and type checking that bpf programs has to do,
>>>>>>> I like the approach of patch 06/10:
>>>>>>> +LANDLOCK_HOOK2(file_open, FILE_OPEN,
>>>>>>> +       PTR_TO_STRUCT_FILE, struct file *, file,
>>>>>>> +       PTR_TO_STRUCT_CRED, const struct cred *, cred
>>>>>>> +)
>>>>>>> teaching verifier to recognize struct file, cred, sockaddr
>>>>>>> will let bpf program access them naturally without any overhead.
>>>>>>> Though:
>>>>>>> @@ -102,6 +102,9 @@ enum bpf_prog_type {
>>>>>>>         BPF_PROG_TYPE_SCHED_CLS,
>>>>>>>         BPF_PROG_TYPE_SCHED_ACT,
>>>>>>>         BPF_PROG_TYPE_TRACEPOINT,
>>>>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_OPEN,
>>>>>>> +       BPF_PROG_TYPE_LANDLOCK_FILE_PERMISSION,
>>>>>>> +       BPF_PROG_TYPE_LANDLOCK_MMAP_FILE,
>>>>>>>  };
>>>>>>> is a bit of overkill.
>>>>>>> I think it would be cleaner to have single
>>>>>>> BPF_PROG_TYPE_LSM and at program load time pass
>>>>>>> lsm_hook_id as well, so that verifier can do safety checks
>>>>>>> based on type info provided in LANDLOCK_HOOKs
>>>>>>
>>>>>> I first started with a unique BPF_PROG_TYPE but, the thing is, the BPF
>>>>>> verifier check programs according to their types. If we need to check
>>>>>> specific context value types (e.g. PTR_TO_STRUCT_FILE), we need a
>>>>>> dedicated program types. I don't see any other way to do it with the
>>>>>> current verifier code. Moreover it's the purpose of program types, right?
>>>>>
>>>>> Adding new bpf program type for every lsm hook is not acceptable.
>>>>> Either do one new program type + pass lsm_hook_id as suggested
>>>>> or please come up with an alternative approach.
>>>>
>>>> OK, so we have to modify the verifier to not only rely on the program
>>>> type but on another value to check the context accesses. Do you have a
>>>> hint from where this value could come from? Do we need to add a new bpf
>>>> command to associate a program to a subtype?
>>>
>>> It's another field prog_subtype (or prog_hook_id) in union bpf_attr.
>>> Both prog_type and prog_hook_id are used during verification.
>>> prog_type distinguishes the main aspects whereas prog_hook_id selects
>>> which lsm_hook's argument definition to apply.
>>> At the time of attaching to a hook, the prog_hook_id passed at the
>>> load time should match lsm's hook_id.
>>
>> OK, so this new prog_subtype field should be use/set by a new bpf_cmd,
>> right? Something like BPF_PROG_SUBTYPE or BPF_PROG_METADATA?
> 
> No new command. It will be an optional field to existing BPF_PROG_LOAD.
> In other words instead of replicating everything that different
> bpf_prog_type-s need, we can pass this hook_id field to fine tune
> the purpose (and applicability) of the program.
> And one BPF_PROG_TYPE_LANDLOCK type will cover all lsm hooks.
> For example existing BPF_PROG_TYPE_TRACEPOINT checks the safety
> for all tracepoints while they all have different arguments.
> For tracepoints it's easier, since the only difference between them
> is the range of ctx access. Here we need strong type safety
> of arguments therefore need extra hook_id at load time.

OK, I will do it.
Andy Lutomirski Aug. 28, 2016, 8:13 a.m. UTC | #23
On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>
>
> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> > On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> >> On 27/08/2016 20:06, Alexei Starovoitov wrote:
> >>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> >>>> As said above, Landlock will not run an eBPF programs when not strictly
> >>>> needed. Attaching to a cgroup will have the same performance impact as
> >>>> attaching to a process hierarchy.
> >>>
> >>> Having a prog per cgroup per lsm_hook is the only scalable way I
> >>> could come up with. If you see another way, please propose.
> >>> current->seccomp.landlock_prog is not the answer.
> >>
> >> Hum, I don't see the difference from a performance point of view between
> >> a cgroup-based or a process hierarchy-based system.
> >>
> >> Maybe a better option should be to use an array of pointers with N
> >> entries, one for each supported hook, instead of a unique pointer list?
> >
> > yes, clearly array dereference is faster than link list walk.
> > Now the question is where to keep this prog_array[num_lsm_hooks] ?
> > Since we cannot keep it inside task_struct, we have to allocate it.
> > Every time the task is creted then. What to do on the fork? That
> > will require changes all over. Then the obvious optimization would be
> > to share this allocated array of prog pointers across multiple tasks...
> > and little by little this new facility will look like cgroup.
> > Hence the suggestion to put this array into cgroup from the start.
>
> I see your point :)
>
> >
> >> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> >> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> >> to use a process hierarchy). The downside will be to handle an LSM hook
> >> program which is not triggered by a seccomp-filter, but this should be
> >> needed anyway to handle interruptions.
> >
> > what do you mean 'not triggered by seccomp' ?
> > You're not suggesting that this lsm has to enable seccomp to be functional?
> > imo that's non starter due to overhead.
>
> Yes, for now, it is triggered by a new seccomp filter return value
> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
> be needed but could be useful to bind a seccomp filter security policy
> with a Landlock one. Waiting for Kees's point of view…
>

I'm not Kees, but I'd be okay with that.  I still think that doing
this by process hierarchy a la seccomp will be easier to use and to
understand (which is quite important for this kind of work) than doing
it by cgroup.

A feature I've wanted to add for a while is to have an fd that
represents a seccomp layer, the idea being that you would set up your
seccomp layer (with syscall filter, landlock hooks, etc) and then you
would have a syscall to install that layer.  Then an unprivileged
sandbox manager could set up its layer and still be able to inject new
processes into it later on, no cgroups needed.

--Andy
Andy Lutomirski Aug. 28, 2016, 8:14 a.m. UTC | #24
On Aug 27, 2016 8:12 PM, "Alexei Starovoitov"
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Aug 27, 2016 at 12:30:36AM -0700, Andy Lutomirski wrote:
> > > cgroup is the common way to group multiple tasks.
> > > Without cgroup only parent<->child relationship will be possible,
> > > which will limit usability of such lsm to a master task that controls
> > > its children. Such api restriction would have been ok, if we could
> > > extend it in the future, but unfortunately task-centric won't allow it
> > > without creating a parallel lsm that is cgroup based.
> > > Therefore I think we have to go with cgroup-centric api and your
> > > application has to use cgroups from the start though only parent-child
> > > would have been enough.
> > > Also I don't think the kernel can afford two bpf based lsm. One task
> > > based and another cgroup based, so we have to find common ground
> > > that suits both use cases.
> > > Having unprivliged access is a subset. There is no strong reason why
> > > cgroup+lsm+bpf should be limited to root only always.
> > > When we can guarantee no pointer leaks, we can allow unpriv.
> >
> > I don't really understand what you mean.  In the context of landlock,
> > which is a *sandbox*, can one of you explain a use case that
> > materially benefits from this type of cgroup usage?  I haven't thought
> > of one.
>
> In case of seccomp-like sandbox where parent controls child processes
> cgroup is not needed. It's needed when container management software
> needs to control a set of applications. If we can have one bpf-based lsm
> that works via cgroup and without, I'd be fine with it. Right now
> I haven't seen a plausible proposal to do that. Therefore cgroup based
> api is a common api that works for sandbox as well, though requiring
> parent to create a cgroup just to control a single child is cumbersome.
>

I don't believe that a common API can work to accomplish your goal.
For privileged container management, the manager is trusted.  For
unprivileged sandboxing, the manager is emphatically not trusted,
which means you need special rules like NO_NEW_PRIVS, and, unless you
want to start restricting setuid and such in some cgroups, you really
do need a different interface for joining the sandbox than whatever
the container manager is using.

What could make sense is to have one BPF-based LSM that supports both
a seccomp-like unprivileged interface and a cgroup-based privileged
interface.  Most of the code for it is the BPF part anyway -- all that
the cgroup or seccomp part needs to do is to figure out which BPF
program(s) to call.

Also, for container management software, you don't really need
everything tied to cgroup -- you just need a way to cleanly add new
processes to the same security context.
Mickaël Salaün Aug. 28, 2016, 9:42 a.m. UTC | #25
On 28/08/2016 10:13, Andy Lutomirski wrote:
> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>>
>>
>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>>>> attaching to a process hierarchy.
>>>>>
>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>>>> could come up with. If you see another way, please propose.
>>>>> current->seccomp.landlock_prog is not the answer.
>>>>
>>>> Hum, I don't see the difference from a performance point of view between
>>>> a cgroup-based or a process hierarchy-based system.
>>>>
>>>> Maybe a better option should be to use an array of pointers with N
>>>> entries, one for each supported hook, instead of a unique pointer list?
>>>
>>> yes, clearly array dereference is faster than link list walk.
>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>> Every time the task is creted then. What to do on the fork? That
>>> will require changes all over. Then the obvious optimization would be
>>> to share this allocated array of prog pointers across multiple tasks...
>>> and little by little this new facility will look like cgroup.
>>> Hence the suggestion to put this array into cgroup from the start.
>>
>> I see your point :)
>>
>>>
>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>>> program which is not triggered by a seccomp-filter, but this should be
>>>> needed anyway to handle interruptions.
>>>
>>> what do you mean 'not triggered by seccomp' ?
>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>> imo that's non starter due to overhead.
>>
>> Yes, for now, it is triggered by a new seccomp filter return value
>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>> be needed but could be useful to bind a seccomp filter security policy
>> with a Landlock one. Waiting for Kees's point of view…
>>
> 
> I'm not Kees, but I'd be okay with that.  I still think that doing
> this by process hierarchy a la seccomp will be easier to use and to
> understand (which is quite important for this kind of work) than doing
> it by cgroup.
> 
> A feature I've wanted to add for a while is to have an fd that
> represents a seccomp layer, the idea being that you would set up your
> seccomp layer (with syscall filter, landlock hooks, etc) and then you
> would have a syscall to install that layer.  Then an unprivileged
> sandbox manager could set up its layer and still be able to inject new
> processes into it later on, no cgroups needed.

A nice thing I didn't highlight about Landlock is that a process can
prepare a layer of rules (arraymap of handles + Landlock programs) and
pass the file descriptors of the Landlock programs to another process.
This process could then apply this programs to get sandboxed. However,
for now, because a Landlock program is only triggered by a seccomp
filter (which do not follow the Landlock programs as a FD), they will be
useless.

The FD referring to an arraymap of handles can also be used to update a
map and change the behavior of a Landlock program. A master process can
then add or remove restrictions to another process hierarchy on the fly.

However, I think it would make more sense to use cgroups if we want to
move an existing (unwilling) unsandoxed process into a sandboxed
environment. Of course, some more no_new_privs checks would be needed.
Andy Lutomirski Aug. 30, 2016, 6:55 p.m. UTC | #26
On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
>
>
> On 28/08/2016 10:13, Andy Lutomirski wrote:
>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>>>
>>>
>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>>>>> attaching to a process hierarchy.
>>>>>>
>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>>>>> could come up with. If you see another way, please propose.
>>>>>> current->seccomp.landlock_prog is not the answer.
>>>>>
>>>>> Hum, I don't see the difference from a performance point of view between
>>>>> a cgroup-based or a process hierarchy-based system.
>>>>>
>>>>> Maybe a better option should be to use an array of pointers with N
>>>>> entries, one for each supported hook, instead of a unique pointer list?
>>>>
>>>> yes, clearly array dereference is faster than link list walk.
>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>>> Every time the task is creted then. What to do on the fork? That
>>>> will require changes all over. Then the obvious optimization would be
>>>> to share this allocated array of prog pointers across multiple tasks...
>>>> and little by little this new facility will look like cgroup.
>>>> Hence the suggestion to put this array into cgroup from the start.
>>>
>>> I see your point :)
>>>
>>>>
>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>>>> program which is not triggered by a seccomp-filter, but this should be
>>>>> needed anyway to handle interruptions.
>>>>
>>>> what do you mean 'not triggered by seccomp' ?
>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>>> imo that's non starter due to overhead.
>>>
>>> Yes, for now, it is triggered by a new seccomp filter return value
>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>>> be needed but could be useful to bind a seccomp filter security policy
>>> with a Landlock one. Waiting for Kees's point of view…
>>>
>>
>> I'm not Kees, but I'd be okay with that.  I still think that doing
>> this by process hierarchy a la seccomp will be easier to use and to
>> understand (which is quite important for this kind of work) than doing
>> it by cgroup.
>>
>> A feature I've wanted to add for a while is to have an fd that
>> represents a seccomp layer, the idea being that you would set up your
>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>> would have a syscall to install that layer.  Then an unprivileged
>> sandbox manager could set up its layer and still be able to inject new
>> processes into it later on, no cgroups needed.
>
> A nice thing I didn't highlight about Landlock is that a process can
> prepare a layer of rules (arraymap of handles + Landlock programs) and
> pass the file descriptors of the Landlock programs to another process.
> This process could then apply this programs to get sandboxed. However,
> for now, because a Landlock program is only triggered by a seccomp
> filter (which do not follow the Landlock programs as a FD), they will be
> useless.
>
> The FD referring to an arraymap of handles can also be used to update a
> map and change the behavior of a Landlock program. A master process can
> then add or remove restrictions to another process hierarchy on the fly.

Maybe this could be extended a little bit.  The fd could hold the
seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
the ability to install it and FMODE_WRITE could give the ability to
modify it.
Mickaël Salaün Aug. 30, 2016, 8:20 p.m. UTC | #27
On 30/08/2016 20:55, Andy Lutomirski wrote:
> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>>
>> On 28/08/2016 10:13, Andy Lutomirski wrote:
>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>>>>
>>>>
>>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>>>>>> attaching to a process hierarchy.
>>>>>>>
>>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>>>>>> could come up with. If you see another way, please propose.
>>>>>>> current->seccomp.landlock_prog is not the answer.
>>>>>>
>>>>>> Hum, I don't see the difference from a performance point of view between
>>>>>> a cgroup-based or a process hierarchy-based system.
>>>>>>
>>>>>> Maybe a better option should be to use an array of pointers with N
>>>>>> entries, one for each supported hook, instead of a unique pointer list?
>>>>>
>>>>> yes, clearly array dereference is faster than link list walk.
>>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>>>> Every time the task is creted then. What to do on the fork? That
>>>>> will require changes all over. Then the obvious optimization would be
>>>>> to share this allocated array of prog pointers across multiple tasks...
>>>>> and little by little this new facility will look like cgroup.
>>>>> Hence the suggestion to put this array into cgroup from the start.
>>>>
>>>> I see your point :)
>>>>
>>>>>
>>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>>>>> program which is not triggered by a seccomp-filter, but this should be
>>>>>> needed anyway to handle interruptions.
>>>>>
>>>>> what do you mean 'not triggered by seccomp' ?
>>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>>>> imo that's non starter due to overhead.
>>>>
>>>> Yes, for now, it is triggered by a new seccomp filter return value
>>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>>>> be needed but could be useful to bind a seccomp filter security policy
>>>> with a Landlock one. Waiting for Kees's point of view…
>>>>
>>>
>>> I'm not Kees, but I'd be okay with that.  I still think that doing
>>> this by process hierarchy a la seccomp will be easier to use and to
>>> understand (which is quite important for this kind of work) than doing
>>> it by cgroup.
>>>
>>> A feature I've wanted to add for a while is to have an fd that
>>> represents a seccomp layer, the idea being that you would set up your
>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>>> would have a syscall to install that layer.  Then an unprivileged
>>> sandbox manager could set up its layer and still be able to inject new
>>> processes into it later on, no cgroups needed.
>>
>> A nice thing I didn't highlight about Landlock is that a process can
>> prepare a layer of rules (arraymap of handles + Landlock programs) and
>> pass the file descriptors of the Landlock programs to another process.
>> This process could then apply this programs to get sandboxed. However,
>> for now, because a Landlock program is only triggered by a seccomp
>> filter (which do not follow the Landlock programs as a FD), they will be
>> useless.
>>
>> The FD referring to an arraymap of handles can also be used to update a
>> map and change the behavior of a Landlock program. A master process can
>> then add or remove restrictions to another process hierarchy on the fly.
> 
> Maybe this could be extended a little bit.  The fd could hold the
> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
> the ability to install it and FMODE_WRITE could give the ability to
> modify it.
> 

This is interesting! It should be possible to append the seccomp stack
of a source process to the seccomp stack of the target process when a
Landlock program is passed and then activated through seccomp(2).

For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
permission of the eBPF program FD in a specific way?
Andy Lutomirski Aug. 30, 2016, 8:23 p.m. UTC | #28
On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 30/08/2016 20:55, Andy Lutomirski wrote:
>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>
>>>
>>> On 28/08/2016 10:13, Andy Lutomirski wrote:
>>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>>>>>
>>>>>
>>>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>>>>>>> attaching to a process hierarchy.
>>>>>>>>
>>>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>>>>>>> could come up with. If you see another way, please propose.
>>>>>>>> current->seccomp.landlock_prog is not the answer.
>>>>>>>
>>>>>>> Hum, I don't see the difference from a performance point of view between
>>>>>>> a cgroup-based or a process hierarchy-based system.
>>>>>>>
>>>>>>> Maybe a better option should be to use an array of pointers with N
>>>>>>> entries, one for each supported hook, instead of a unique pointer list?
>>>>>>
>>>>>> yes, clearly array dereference is faster than link list walk.
>>>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>>>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>>>>> Every time the task is creted then. What to do on the fork? That
>>>>>> will require changes all over. Then the obvious optimization would be
>>>>>> to share this allocated array of prog pointers across multiple tasks...
>>>>>> and little by little this new facility will look like cgroup.
>>>>>> Hence the suggestion to put this array into cgroup from the start.
>>>>>
>>>>> I see your point :)
>>>>>
>>>>>>
>>>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>>>>>> program which is not triggered by a seccomp-filter, but this should be
>>>>>>> needed anyway to handle interruptions.
>>>>>>
>>>>>> what do you mean 'not triggered by seccomp' ?
>>>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>>>>> imo that's non starter due to overhead.
>>>>>
>>>>> Yes, for now, it is triggered by a new seccomp filter return value
>>>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>>>>> be needed but could be useful to bind a seccomp filter security policy
>>>>> with a Landlock one. Waiting for Kees's point of view…
>>>>>
>>>>
>>>> I'm not Kees, but I'd be okay with that.  I still think that doing
>>>> this by process hierarchy a la seccomp will be easier to use and to
>>>> understand (which is quite important for this kind of work) than doing
>>>> it by cgroup.
>>>>
>>>> A feature I've wanted to add for a while is to have an fd that
>>>> represents a seccomp layer, the idea being that you would set up your
>>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>>>> would have a syscall to install that layer.  Then an unprivileged
>>>> sandbox manager could set up its layer and still be able to inject new
>>>> processes into it later on, no cgroups needed.
>>>
>>> A nice thing I didn't highlight about Landlock is that a process can
>>> prepare a layer of rules (arraymap of handles + Landlock programs) and
>>> pass the file descriptors of the Landlock programs to another process.
>>> This process could then apply this programs to get sandboxed. However,
>>> for now, because a Landlock program is only triggered by a seccomp
>>> filter (which do not follow the Landlock programs as a FD), they will be
>>> useless.
>>>
>>> The FD referring to an arraymap of handles can also be used to update a
>>> map and change the behavior of a Landlock program. A master process can
>>> then add or remove restrictions to another process hierarchy on the fly.
>>
>> Maybe this could be extended a little bit.  The fd could hold the
>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
>> the ability to install it and FMODE_WRITE could give the ability to
>> modify it.
>>
>
> This is interesting! It should be possible to append the seccomp stack
> of a source process to the seccomp stack of the target process when a
> Landlock program is passed and then activated through seccomp(2).
>
> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
> permission of the eBPF program FD in a specific way?
>

This wouldn't be an eBPF program FD -- it would be an FD encapsulating
an entire configuration including seccomp BPF program, whatever
landlock stuff is associated, and eventual seccomp monitor
configuration (once I write that code), etc.

You wouldn't say "attach this process's seccomp stack to me" -- you'd
say "attach this seccomp layer to me".

A decision that we'd have to make would be whether the FD links to the
parent layer or whether it can be attached without regard to what the
parent layer is.

--Andy
Mickaël Salaün Aug. 30, 2016, 8:33 p.m. UTC | #29
On 30/08/2016 22:23, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 30/08/2016 20:55, Andy Lutomirski wrote:
>>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>>
>>>> On 28/08/2016 10:13, Andy Lutomirski wrote:
>>>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
>>>>>>
>>>>>>
>>>>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>>>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>>>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>>>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>>>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
>>>>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
>>>>>>>>>> attaching to a process hierarchy.
>>>>>>>>>
>>>>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>>>>>>>> could come up with. If you see another way, please propose.
>>>>>>>>> current->seccomp.landlock_prog is not the answer.
>>>>>>>>
>>>>>>>> Hum, I don't see the difference from a performance point of view between
>>>>>>>> a cgroup-based or a process hierarchy-based system.
>>>>>>>>
>>>>>>>> Maybe a better option should be to use an array of pointers with N
>>>>>>>> entries, one for each supported hook, instead of a unique pointer list?
>>>>>>>
>>>>>>> yes, clearly array dereference is faster than link list walk.
>>>>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>>>>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>>>>>> Every time the task is creted then. What to do on the fork? That
>>>>>>> will require changes all over. Then the obvious optimization would be
>>>>>>> to share this allocated array of prog pointers across multiple tasks...
>>>>>>> and little by little this new facility will look like cgroup.
>>>>>>> Hence the suggestion to put this array into cgroup from the start.
>>>>>>
>>>>>> I see your point :)
>>>>>>
>>>>>>>
>>>>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>>>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>>>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>>>>>>> program which is not triggered by a seccomp-filter, but this should be
>>>>>>>> needed anyway to handle interruptions.
>>>>>>>
>>>>>>> what do you mean 'not triggered by seccomp' ?
>>>>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>>>>>> imo that's non starter due to overhead.
>>>>>>
>>>>>> Yes, for now, it is triggered by a new seccomp filter return value
>>>>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>>>>>> be needed but could be useful to bind a seccomp filter security policy
>>>>>> with a Landlock one. Waiting for Kees's point of view…
>>>>>>
>>>>>
>>>>> I'm not Kees, but I'd be okay with that.  I still think that doing
>>>>> this by process hierarchy a la seccomp will be easier to use and to
>>>>> understand (which is quite important for this kind of work) than doing
>>>>> it by cgroup.
>>>>>
>>>>> A feature I've wanted to add for a while is to have an fd that
>>>>> represents a seccomp layer, the idea being that you would set up your
>>>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>>>>> would have a syscall to install that layer.  Then an unprivileged
>>>>> sandbox manager could set up its layer and still be able to inject new
>>>>> processes into it later on, no cgroups needed.
>>>>
>>>> A nice thing I didn't highlight about Landlock is that a process can
>>>> prepare a layer of rules (arraymap of handles + Landlock programs) and
>>>> pass the file descriptors of the Landlock programs to another process.
>>>> This process could then apply this programs to get sandboxed. However,
>>>> for now, because a Landlock program is only triggered by a seccomp
>>>> filter (which do not follow the Landlock programs as a FD), they will be
>>>> useless.
>>>>
>>>> The FD referring to an arraymap of handles can also be used to update a
>>>> map and change the behavior of a Landlock program. A master process can
>>>> then add or remove restrictions to another process hierarchy on the fly.
>>>
>>> Maybe this could be extended a little bit.  The fd could hold the
>>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
>>> the ability to install it and FMODE_WRITE could give the ability to
>>> modify it.
>>>
>>
>> This is interesting! It should be possible to append the seccomp stack
>> of a source process to the seccomp stack of the target process when a
>> Landlock program is passed and then activated through seccomp(2).
>>
>> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
>> permission of the eBPF program FD in a specific way?
>>
> 
> This wouldn't be an eBPF program FD -- it would be an FD encapsulating
> an entire configuration including seccomp BPF program, whatever
> landlock stuff is associated, and eventual seccomp monitor
> configuration (once I write that code), etc.
> 
> You wouldn't say "attach this process's seccomp stack to me" -- you'd
> say "attach this seccomp layer to me".
> 
> A decision that we'd have to make would be whether the FD links to the
> parent layer or whether it can be attached without regard to what the
> parent layer is.

OK, I like that, but I think it could be done on a second time. :)
Alexei Starovoitov Aug. 30, 2016, 8:55 p.m. UTC | #30
On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote:
> 
> 
> On 30/08/2016 22:23, Andy Lutomirski wrote:
> > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün <mic@digikod.net> wrote:
> >>
> >> On 30/08/2016 20:55, Andy Lutomirski wrote:
> >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
> >>>>
> >>>>
> >>>> On 28/08/2016 10:13, Andy Lutomirski wrote:
> >>>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> >>>>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> >>>>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
> >>>>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> >>>>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
> >>>>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
> >>>>>>>>>> attaching to a process hierarchy.
> >>>>>>>>>
> >>>>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
> >>>>>>>>> could come up with. If you see another way, please propose.
> >>>>>>>>> current->seccomp.landlock_prog is not the answer.
> >>>>>>>>
> >>>>>>>> Hum, I don't see the difference from a performance point of view between
> >>>>>>>> a cgroup-based or a process hierarchy-based system.
> >>>>>>>>
> >>>>>>>> Maybe a better option should be to use an array of pointers with N
> >>>>>>>> entries, one for each supported hook, instead of a unique pointer list?
> >>>>>>>
> >>>>>>> yes, clearly array dereference is faster than link list walk.
> >>>>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> >>>>>>> Since we cannot keep it inside task_struct, we have to allocate it.
> >>>>>>> Every time the task is creted then. What to do on the fork? That
> >>>>>>> will require changes all over. Then the obvious optimization would be
> >>>>>>> to share this allocated array of prog pointers across multiple tasks...
> >>>>>>> and little by little this new facility will look like cgroup.
> >>>>>>> Hence the suggestion to put this array into cgroup from the start.
> >>>>>>
> >>>>>> I see your point :)
> >>>>>>
> >>>>>>>
> >>>>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> >>>>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> >>>>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
> >>>>>>>> program which is not triggered by a seccomp-filter, but this should be
> >>>>>>>> needed anyway to handle interruptions.
> >>>>>>>
> >>>>>>> what do you mean 'not triggered by seccomp' ?
> >>>>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
> >>>>>>> imo that's non starter due to overhead.
> >>>>>>
> >>>>>> Yes, for now, it is triggered by a new seccomp filter return value
> >>>>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
> >>>>>> be needed but could be useful to bind a seccomp filter security policy
> >>>>>> with a Landlock one. Waiting for Kees's point of view…
> >>>>>>
> >>>>>
> >>>>> I'm not Kees, but I'd be okay with that.  I still think that doing
> >>>>> this by process hierarchy a la seccomp will be easier to use and to
> >>>>> understand (which is quite important for this kind of work) than doing
> >>>>> it by cgroup.
> >>>>>
> >>>>> A feature I've wanted to add for a while is to have an fd that
> >>>>> represents a seccomp layer, the idea being that you would set up your
> >>>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
> >>>>> would have a syscall to install that layer.  Then an unprivileged
> >>>>> sandbox manager could set up its layer and still be able to inject new
> >>>>> processes into it later on, no cgroups needed.
> >>>>
> >>>> A nice thing I didn't highlight about Landlock is that a process can
> >>>> prepare a layer of rules (arraymap of handles + Landlock programs) and
> >>>> pass the file descriptors of the Landlock programs to another process.
> >>>> This process could then apply this programs to get sandboxed. However,
> >>>> for now, because a Landlock program is only triggered by a seccomp
> >>>> filter (which do not follow the Landlock programs as a FD), they will be
> >>>> useless.
> >>>>
> >>>> The FD referring to an arraymap of handles can also be used to update a
> >>>> map and change the behavior of a Landlock program. A master process can
> >>>> then add or remove restrictions to another process hierarchy on the fly.
> >>>
> >>> Maybe this could be extended a little bit.  The fd could hold the
> >>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
> >>> the ability to install it and FMODE_WRITE could give the ability to
> >>> modify it.
> >>>
> >>
> >> This is interesting! It should be possible to append the seccomp stack
> >> of a source process to the seccomp stack of the target process when a
> >> Landlock program is passed and then activated through seccomp(2).
> >>
> >> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
> >> permission of the eBPF program FD in a specific way?
> >>
> > 
> > This wouldn't be an eBPF program FD -- it would be an FD encapsulating
> > an entire configuration including seccomp BPF program, whatever
> > landlock stuff is associated, and eventual seccomp monitor
> > configuration (once I write that code), etc.
> > 
> > You wouldn't say "attach this process's seccomp stack to me" -- you'd
> > say "attach this seccomp layer to me".
> > 
> > A decision that we'd have to make would be whether the FD links to the
> > parent layer or whether it can be attached without regard to what the
> > parent layer is.
> 
> OK, I like that, but I think it could be done on a second time. :)

I don't. Single FD that is a collection of objects seems an odd abstraction
to me. I also don't see what it actually solves.
I think lsm and seccomp should be orthogonal and not tied into each other.
Andy Lutomirski Aug. 30, 2016, 9:45 p.m. UTC | #31
On Aug 30, 2016 1:56 PM, "Alexei Starovoitov"
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote:
> >
> >
> > On 30/08/2016 22:23, Andy Lutomirski wrote:
> > > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün <mic@digikod.net> wrote:
> > >>
> > >> On 30/08/2016 20:55, Andy Lutomirski wrote:
> > >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün <mic@digikod.net> wrote:
> > >>>>
> > >>>>
> > >>>> On 28/08/2016 10:13, Andy Lutomirski wrote:
> > >>>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" <mic@digikod.net> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> > >>>>>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> > >>>>>>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
> > >>>>>>>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> > >>>>>>>>>> As said above, Landlock will not run an eBPF programs when not strictly
> > >>>>>>>>>> needed. Attaching to a cgroup will have the same performance impact as
> > >>>>>>>>>> attaching to a process hierarchy.
> > >>>>>>>>>
> > >>>>>>>>> Having a prog per cgroup per lsm_hook is the only scalable way I
> > >>>>>>>>> could come up with. If you see another way, please propose.
> > >>>>>>>>> current->seccomp.landlock_prog is not the answer.
> > >>>>>>>>
> > >>>>>>>> Hum, I don't see the difference from a performance point of view between
> > >>>>>>>> a cgroup-based or a process hierarchy-based system.
> > >>>>>>>>
> > >>>>>>>> Maybe a better option should be to use an array of pointers with N
> > >>>>>>>> entries, one for each supported hook, instead of a unique pointer list?
> > >>>>>>>
> > >>>>>>> yes, clearly array dereference is faster than link list walk.
> > >>>>>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> > >>>>>>> Since we cannot keep it inside task_struct, we have to allocate it.
> > >>>>>>> Every time the task is creted then. What to do on the fork? That
> > >>>>>>> will require changes all over. Then the obvious optimization would be
> > >>>>>>> to share this allocated array of prog pointers across multiple tasks...
> > >>>>>>> and little by little this new facility will look like cgroup.
> > >>>>>>> Hence the suggestion to put this array into cgroup from the start.
> > >>>>>>
> > >>>>>> I see your point :)
> > >>>>>>
> > >>>>>>>
> > >>>>>>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> > >>>>>>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> > >>>>>>>> to use a process hierarchy). The downside will be to handle an LSM hook
> > >>>>>>>> program which is not triggered by a seccomp-filter, but this should be
> > >>>>>>>> needed anyway to handle interruptions.
> > >>>>>>>
> > >>>>>>> what do you mean 'not triggered by seccomp' ?
> > >>>>>>> You're not suggesting that this lsm has to enable seccomp to be functional?
> > >>>>>>> imo that's non starter due to overhead.
> > >>>>>>
> > >>>>>> Yes, for now, it is triggered by a new seccomp filter return value
> > >>>>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
> > >>>>>> be needed but could be useful to bind a seccomp filter security policy
> > >>>>>> with a Landlock one. Waiting for Kees's point of view…
> > >>>>>>
> > >>>>>
> > >>>>> I'm not Kees, but I'd be okay with that.  I still think that doing
> > >>>>> this by process hierarchy a la seccomp will be easier to use and to
> > >>>>> understand (which is quite important for this kind of work) than doing
> > >>>>> it by cgroup.
> > >>>>>
> > >>>>> A feature I've wanted to add for a while is to have an fd that
> > >>>>> represents a seccomp layer, the idea being that you would set up your
> > >>>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
> > >>>>> would have a syscall to install that layer.  Then an unprivileged
> > >>>>> sandbox manager could set up its layer and still be able to inject new
> > >>>>> processes into it later on, no cgroups needed.
> > >>>>
> > >>>> A nice thing I didn't highlight about Landlock is that a process can
> > >>>> prepare a layer of rules (arraymap of handles + Landlock programs) and
> > >>>> pass the file descriptors of the Landlock programs to another process.
> > >>>> This process could then apply this programs to get sandboxed. However,
> > >>>> for now, because a Landlock program is only triggered by a seccomp
> > >>>> filter (which do not follow the Landlock programs as a FD), they will be
> > >>>> useless.
> > >>>>
> > >>>> The FD referring to an arraymap of handles can also be used to update a
> > >>>> map and change the behavior of a Landlock program. A master process can
> > >>>> then add or remove restrictions to another process hierarchy on the fly.
> > >>>
> > >>> Maybe this could be extended a little bit.  The fd could hold the
> > >>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
> > >>> the ability to install it and FMODE_WRITE could give the ability to
> > >>> modify it.
> > >>>
> > >>
> > >> This is interesting! It should be possible to append the seccomp stack
> > >> of a source process to the seccomp stack of the target process when a
> > >> Landlock program is passed and then activated through seccomp(2).
> > >>
> > >> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
> > >> permission of the eBPF program FD in a specific way?
> > >>
> > >
> > > This wouldn't be an eBPF program FD -- it would be an FD encapsulating
> > > an entire configuration including seccomp BPF program, whatever
> > > landlock stuff is associated, and eventual seccomp monitor
> > > configuration (once I write that code), etc.
> > >
> > > You wouldn't say "attach this process's seccomp stack to me" -- you'd
> > > say "attach this seccomp layer to me".
> > >
> > > A decision that we'd have to make would be whether the FD links to the
> > > parent layer or whether it can be attached without regard to what the
> > > parent layer is.
> >
> > OK, I like that, but I think it could be done on a second time. :)
>
> I don't. Single FD that is a collection of objects seems an odd abstraction
> to me. I also don't see what it actually solves.
> I think lsm and seccomp should be orthogonal and not tied into each other.
>

It's not a random collection of objects.  It's a fully configured
sandboxing layer.

One might argue that landlock shouldn't be tied to seccomp (in theory,
attached progs could be given access to syscall_get_xyz()), but I
think that the seccomp attachment mechanism is the right way to
install unprivileged filters.  It handles the no_new_privs stuff, it
allows TSYNC, it's totally independent of systemwide policy, etc.

Trying to use cgroups or similar for this is going to be much nastier.
Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of
putting cgroupfs in their containers, so requiring cgroups or similar
would be a mess for that type of application.

--Andy
Alexei Starovoitov Aug. 31, 2016, 1:36 a.m. UTC | #32
On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote:
> 
> One might argue that landlock shouldn't be tied to seccomp (in theory,
> attached progs could be given access to syscall_get_xyz()), but I

proposed lsm is way more powerful than syscall_get_xyz.
no need to dumb it down.

> think that the seccomp attachment mechanism is the right way to
> install unprivileged filters.  It handles the no_new_privs stuff, it
> allows TSYNC, it's totally independent of systemwide policy, etc.
> 
> Trying to use cgroups or similar for this is going to be much nastier.
> Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of
> putting cgroupfs in their containers, so requiring cgroups or similar
> would be a mess for that type of application.

I don't see why it is a 'mess'. cgroups are already used by majority
of the systems, so I don't see why requiring a cgroup is such a big deal.
But let's say we don't do them. How implementation is going to look like
for task based hierarchy? Note that we need an array of bpf_prog pointers.
One for each lsm hook. Where this array is going to be stored?
We cannot put in task_struct, since it's too large. Cannot put it
into 'struct seccomp' directly either, unless it will become a pointer.
Is that the proposal?
So now we will be wasting extra 1kbyte of memory per task. Not great.
We'd want to optimize it by sharing this such struct seccomp with prog array
across threads of the same task? Or dynimically allocating it when
landlock is in use? May sound nice, but how to account for that kernel
memory? I guess also solvable by charging memlock.
With cgroup based approach we don't need to worry about all that.
Andy Lutomirski Aug. 31, 2016, 3:29 a.m. UTC | #33
On Tue, Aug 30, 2016 at 6:36 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote:
>>
>> One might argue that landlock shouldn't be tied to seccomp (in theory,
>> attached progs could be given access to syscall_get_xyz()), but I
>
> proposed lsm is way more powerful than syscall_get_xyz.
> no need to dumb it down.

I think you're misunderstanding me.

Mickaël's code allows one to make the LSM hook filters depend on the
syscall using SECCOMP_RET_LANDLOCK.  I'm suggesting that a similar
effect could be achieved by allowing the eBPF LSM hook to call
syscall_get_xyz() if it wants to.

>
>> think that the seccomp attachment mechanism is the right way to
>> install unprivileged filters.  It handles the no_new_privs stuff, it
>> allows TSYNC, it's totally independent of systemwide policy, etc.
>>
>> Trying to use cgroups or similar for this is going to be much nastier.
>> Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of
>> putting cgroupfs in their containers, so requiring cgroups or similar
>> would be a mess for that type of application.
>
> I don't see why it is a 'mess'. cgroups are already used by majority
> of the systems, so I don't see why requiring a cgroup is such a big deal.

Requiring cgroup to be configured in isn't a big deal.  Requiring

> But let's say we don't do them. How implementation is going to look like
> for task based hierarchy? Note that we need an array of bpf_prog pointers.
> One for each lsm hook. Where this array is going to be stored?
> We cannot put in task_struct, since it's too large. Cannot put it
> into 'struct seccomp' directly either, unless it will become a pointer.
> Is that the proposal?

It would go in struct seccomp_filter or in something pointed to from there.

> So now we will be wasting extra 1kbyte of memory per task. Not great.
> We'd want to optimize it by sharing this such struct seccomp with prog array
> across threads of the same task? Or dynimically allocating it when
> landlock is in use? May sound nice, but how to account for that kernel
> memory? I guess also solvable by charging memlock.
> With cgroup based approach we don't need to worry about all that.
>

The considerations are essentially identical either way.

With cgroups, if you want to share the memory between multiple
separate sandboxes (Firejail instances, Sandstorm grains, Chromium
instances, xdg-apps, etc), you'd need to get them to all coordinate to
share a cgroup.  With a seccomp-like interface, you'd need to get them
to coordinate to share an installed layer (using my FD idea or
similar).

There would *not* be any duplication of this memory just because a
sandboxed process called fork().

--Andy
diff mbox

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 79014aedbea4..9e6786e7a40a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -14,6 +14,9 @@ 
 
 #ifdef CONFIG_SECURITY_LANDLOCK
 #include <linux/fs.h> /* struct file */
+#ifdef CONFIG_CGROUPS
+#include <linux/cgroup-defs.h> /* struct cgroup_subsys_state */
+#endif	/* CONFIG_CGROUPS */
 #endif /* CONFIG_SECURITY_LANDLOCK */
 
 struct bpf_map;
@@ -85,6 +88,7 @@  enum bpf_arg_type {
 	ARG_PTR_TO_STRUCT_FILE,		/* pointer to struct file */
 	ARG_PTR_TO_STRUCT_CRED,		/* pointer to struct cred */
 	ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS,	/* pointer to Landlock FS handle */
+	ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP,	/* pointer to Landlock cgroup handle */
 };
 
 /* type of values returned from helper functions */
@@ -148,6 +152,7 @@  enum bpf_reg_type {
 	PTR_TO_STRUCT_FILE,
 	PTR_TO_STRUCT_CRED,
 	CONST_PTR_TO_LANDLOCK_HANDLE_FS,
+	CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP,
 };
 
 struct bpf_prog;
@@ -212,6 +217,9 @@  struct map_landlock_handle {
 	u32 type; /* e.g. BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD */
 	union {
 		struct file *file;
+#ifdef CONFIG_CGROUPS
+		struct cgroup_subsys_state *css;
+#endif	/* CONFIG_CGROUPS */
 	};
 };
 #endif /* CONFIG_SECURITY_LANDLOCK */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 88af79dd668c..7f60b9fdb35c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -90,12 +90,14 @@  enum bpf_map_type {
 enum bpf_map_array_type {
 	BPF_MAP_ARRAY_TYPE_UNSPEC,
 	BPF_MAP_ARRAY_TYPE_LANDLOCK_FS,
+	BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP,
 };
 
 enum bpf_map_handle_type {
 	BPF_MAP_HANDLE_TYPE_UNSPEC,
 	BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD,
 	BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_GLOB,
+	BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD,
 };
 
 enum bpf_map_array_op {
@@ -364,6 +366,19 @@  enum bpf_func_id {
 	 */
 	BPF_FUNC_landlock_cmp_fs_beneath_with_struct_file,
 
+	/**
+	 * bpf_landlock_cmp_cgroup_beneath(opt, map, map_op)
+	 * Check if the current process is a leaf of cgroup handles
+	 *
+	 * @opt: check options (e.g. LANDLOCK_FLAG_OPT_REVERSE)
+	 * @map: handles to compare against
+	 * @map_op: which elements of the map to use (e.g. BPF_MAP_ARRAY_OP_OR)
+	 *
+	 * Return: 0 if the current cgroup is the sam or beneath the handle,
+	 * 1 otherwise, or a negative value if an error occurred.
+	 */
+	BPF_FUNC_landlock_cmp_cgroup_beneath,
+
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 6804dafd8355..050b3d8d88c8 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -19,6 +19,12 @@ 
 #include <linux/file.h> /* fput() */
 #include <linux/fs.h> /* struct file */
 
+#ifdef CONFIG_SECURITY_LANDLOCK
+#ifdef CONFIG_CGROUPS
+#include <linux/cgroup-defs.h> /* struct cgroup_subsys_state */
+#endif	/* CONFIG_CGROUPS */
+#endif /* CONFIG_SECURITY_LANDLOCK */
+
 static void bpf_array_free_percpu(struct bpf_array *array)
 {
 	int i;
@@ -514,6 +520,12 @@  static void landlock_put_handle(struct map_landlock_handle *handle)
 		else
 			WARN_ON(1);
 		break;
+	case BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD:
+		if (likely(handle->css))
+			css_put(handle->css);
+		else
+			WARN_ON(1);
+		break;
 	default:
 		WARN_ON(1);
 	}
@@ -541,6 +553,10 @@  static enum bpf_map_array_type landlock_get_array_type(
 	case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD:
 	case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_GLOB:
 		return BPF_MAP_ARRAY_TYPE_LANDLOCK_FS;
+	case BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD:
+#ifdef CONFIG_CGROUPS
+		return BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP;
+#endif	/* CONFIG_CGROUPS */
 	case BPF_MAP_HANDLE_TYPE_UNSPEC:
 	default:
 		return -EINVAL;
@@ -557,6 +573,9 @@  static inline long landlock_store_handle(struct map_landlock_handle *dst,
 		struct landlock_handle *khandle)
 {
 	struct path kpath;
+#ifdef CONFIG_CGROUPS
+	struct cgroup_subsys_state *css;
+#endif	/* CONFIG_CGROUPS */
 	struct file *handle_file;
 
 	if (unlikely(!khandle))
@@ -569,6 +588,17 @@  static inline long landlock_store_handle(struct map_landlock_handle *dst,
 		FGET_OR_RET(handle_file, khandle->fd);
 		dst->file = handle_file;
 		break;
+#ifdef CONFIG_CGROUPS
+	case BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD:
+		FGET_OR_RET(handle_file, khandle->fd);
+		css = css_tryget_online_from_dir(file_dentry(handle_file), NULL);
+		fput(handle_file);
+		/* NULL css check done by css_tryget_online_from_dir() */
+		if (IS_ERR(css))
+			return PTR_ERR(css);
+		dst->css = css;
+		break;
+#endif	/* CONFIG_CGROUPS */
 	default:
 		WARN_ON(1);
 		path_put(&kpath);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b182c88d5c13..b4e5c3bbc520 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -247,6 +247,7 @@  static const char * const reg_type_str[] = {
 	[PTR_TO_STRUCT_FILE]	= "struct_file",
 	[PTR_TO_STRUCT_CRED]	= "struct_cred",
 	[CONST_PTR_TO_LANDLOCK_HANDLE_FS] = "landlock_handle_fs",
+	[CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP] = "landlock_handle_cgroup",
 };
 
 static void print_verifier_state(struct verifier_state *state)
@@ -560,6 +561,7 @@  static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_STRUCT_FILE:
 	case PTR_TO_STRUCT_CRED:
 	case CONST_PTR_TO_LANDLOCK_HANDLE_FS:
+	case CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP:
 		return true;
 	default:
 		return false;
@@ -955,6 +957,8 @@  static int check_func_arg(struct verifier_env *env, u32 regno,
 		expected_type = PTR_TO_STRUCT_CRED;
 	} else if (arg_type == ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS) {
 		expected_type = CONST_PTR_TO_LANDLOCK_HANDLE_FS;
+	} else if (arg_type == ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP) {
+		expected_type = CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP;
 	} else if (arg_type == ARG_PTR_TO_STACK ||
 		   arg_type == ARG_PTR_TO_RAW_STACK) {
 		expected_type = PTR_TO_STACK;
@@ -1733,6 +1737,8 @@  static inline enum bpf_reg_type bpf_reg_type_from_map(struct bpf_map *map)
 	switch (map->map_array_type) {
 	case BPF_MAP_ARRAY_TYPE_LANDLOCK_FS:
 		return CONST_PTR_TO_LANDLOCK_HANDLE_FS;
+	case BPF_MAP_ARRAY_TYPE_LANDLOCK_CGROUP:
+		return CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP;
 	case BPF_MAP_ARRAY_TYPE_UNSPEC:
 	default:
 		return CONST_PTR_TO_MAP;
diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
index dc8328d216d7..414eb047e50e 100644
--- a/security/landlock/Kconfig
+++ b/security/landlock/Kconfig
@@ -10,6 +10,9 @@  config SECURITY_LANDLOCK
 	  of stacked eBPF programs for some LSM hooks. Each program can do some
 	  access comparison to check if an access request is legitimate.
 
+	  It is recommended to enable cgroups to be able to match a policy
+	  according to a group of processes.
+
 	  Further information about eBPF can be found in
 	  Documentation/networking/filter.txt
 
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 27f359a8cfaa..cdaaa152b849 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,3 +1,3 @@ 
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
-landlock-y := lsm.o checker_fs.o
+landlock-y := lsm.o checker_fs.o checker_cgroup.o
diff --git a/security/landlock/checker_cgroup.c b/security/landlock/checker_cgroup.c
new file mode 100644
index 000000000000..97f29ac64188
--- /dev/null
+++ b/security/landlock/checker_cgroup.c
@@ -0,0 +1,96 @@ 
+/*
+ * Landlock LSM - cgroup Checkers
+ *
+ * Copyright (C) 2016  Mickaël Salaün <mic@digikod.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifdef CONFIG_CGROUPS
+
+#include <asm/current.h>
+#include <linux/bpf.h> /* enum bpf_map_array_op */
+#include <linux/cgroup-defs.h> /* struct cgroup_subsys_state */
+#include <linux/cgroup.h> /* cgroup_is_descendant(), task_css_set() */
+#include <linux/errno.h>
+
+#include "checker_cgroup.h"
+
+
+/*
+ * bpf_landlock_cmp_cgroup_beneath
+ *
+ * Cf. include/uapi/linux/bpf.h
+ */
+static inline u64 bpf_landlock_cmp_cgroup_beneath(u64 r1_option, u64 r2_map,
+		u64 r3_map_op, u64 r4, u64 r5)
+{
+	u8 option = (u8) r1_option;
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
+	enum bpf_map_array_op map_op = r3_map_op;
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	struct cgroup *cg1, *cg2;
+	struct map_landlock_handle *handle;
+	int i;
+
+	/* ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP is an arraymap */
+	if (unlikely(!map)) {
+		WARN_ON(1);
+		return -EFAULT;
+	}
+	if (unlikely((option | _LANDLOCK_FLAG_OPT_MASK) != _LANDLOCK_FLAG_OPT_MASK))
+		return -EINVAL;
+
+	/* for now, only handle OP_OR */
+	switch (map_op) {
+	case BPF_MAP_ARRAY_OP_OR:
+		break;
+	case BPF_MAP_ARRAY_OP_UNSPEC:
+	case BPF_MAP_ARRAY_OP_AND:
+	case BPF_MAP_ARRAY_OP_XOR:
+	default:
+		return -EINVAL;
+	}
+
+	synchronize_rcu();
+
+	for (i = 0; i < array->n_entries; i++) {
+		handle = (struct map_landlock_handle *)
+				(array->value + array->elem_size * i);
+
+		/* protected by the proto types, should not happen */
+		if (unlikely(handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_CGROUP_FD)) {
+			WARN_ON(1);
+			return -EFAULT;
+		}
+		if (unlikely(!handle->css)) {
+			WARN_ON(1);
+			return -EFAULT;
+		}
+
+		if (option & LANDLOCK_FLAG_OPT_REVERSE) {
+			cg1 = handle->css->cgroup;
+			cg2 = task_css_set(current)->dfl_cgrp;
+		} else {
+			cg1 = task_css_set(current)->dfl_cgrp;
+			cg2 = handle->css->cgroup;
+		}
+
+		if (cgroup_is_descendant(cg1, cg2))
+			return 0;
+	}
+	return 1;
+}
+
+const struct bpf_func_proto bpf_landlock_cmp_cgroup_beneath_proto = {
+	.func		= bpf_landlock_cmp_cgroup_beneath,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+	.arg2_type	= ARG_CONST_PTR_TO_LANDLOCK_HANDLE_CGROUP,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+#endif	/* CONFIG_CGROUPS */
diff --git a/security/landlock/checker_cgroup.h b/security/landlock/checker_cgroup.h
new file mode 100644
index 000000000000..497cad7c2bb8
--- /dev/null
+++ b/security/landlock/checker_cgroup.h
@@ -0,0 +1,18 @@ 
+/*
+ * Landlock LSM - cgroup Checkers
+ *
+ * Copyright (C) 2016  Mickaël Salaün <mic@digikod.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifdef CONFIG_CGROUPS
+#ifndef _SECURITY_LANDLOCK_CHECKER_CGROUP_H
+#define _SECURITY_LANDLOCK_CHECKER_CGROUP_H
+
+extern const struct bpf_func_proto bpf_landlock_cmp_cgroup_beneath_proto;
+
+#endif /* _SECURITY_LANDLOCK_CHECKER_CGROUP_H */
+#endif /* CONFIG_CGROUPS */
diff --git a/security/landlock/lsm.c b/security/landlock/lsm.c
index 8645743243b6..cc4759f4e6c5 100644
--- a/security/landlock/lsm.c
+++ b/security/landlock/lsm.c
@@ -18,6 +18,10 @@ 
 
 #include "checker_fs.h"
 
+#ifdef CONFIG_CGROUPS
+#include "checker_cgroup.h"
+#endif /* CONFIG_CGROUPS */
+
 #define LANDLOCK_HOOK_INIT(NAME) LSM_HOOK_INIT(NAME, landlock_hook_##NAME)
 
 #define LANDLOCK_HOOKx(X, NAME, CNAME, ...)				\
@@ -124,6 +128,10 @@  static const struct bpf_func_proto *bpf_landlock_func_proto(
 		return &bpf_landlock_cmp_fs_prop_with_struct_file_proto;
 	case BPF_FUNC_landlock_cmp_fs_beneath_with_struct_file:
 		return &bpf_landlock_cmp_fs_beneath_with_struct_file_proto;
+	case BPF_FUNC_landlock_cmp_cgroup_beneath:
+#ifdef CONFIG_CGROUPS
+		return &bpf_landlock_cmp_cgroup_beneath_proto;
+#endif	/* CONFIG_CGROUPS */
 	default:
 		return NULL;
 	}